Pipelining. CS701 High Performance Computing

Similar documents
IS360 - High Performance Computing. Basavaraj Talawar CSE, NITK

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Pipelining. Maurizio Palesi

Instruction Pipelining

PIPELINING AND PROCESSOR PERFORMANCE

EITF20: Computer Architecture Part2.2.1: Pipeline-1

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 152, Spring 2011 Section 2

ECSE 425 Lecture 6: Pipelining

Instruction Pipelining

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Computer Architecture. Lecture 6.1: Fundamentals of

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Computer System. Agenda

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Appendix C. Abdullah Muzahid CS 5513

What is Pipelining? RISC remainder (our assumptions)

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

UCB CS61C : Machine Structures

Computer Architecture

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Computer System. Hiroaki Kobayashi 6/16/2010. Ver /16/2010 Computer Science 1

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Appendix A. Overview

UCB CS61C : Machine Structures

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A Fast Instruction Set Simulator for RISC-V

Chapter 4. The Processor

CSE 141 Summer 2016 Homework 2

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Processor Architecture

Computer Architecture EE 4720 Midterm Examination

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Energy Models for DVFS Processors

Lecture 10: Simple Data Path

Advanced Computer Architecture

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Instructor Information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Computer Systems Architecture Spring 2016

Simple Instruction Pipelining

Pipeline: Introduction

ECS 154B Computer Architecture II Spring 2009

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Lecture 4: Instruction Set Architecture

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

mywbut.com Pipelining

Introduction to Pipelined Datapath

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Improving Performance: Pipelining

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

CMSC411 Fall 2013 Midterm 1

COSC 6385 Computer Architecture - Pipelining

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Single cycle MIPS data path without Forwarding, Control, or Hazard Unit

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

DLX Unpipelined Implementation

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline Pipelining

CS3350B Computer Architecture Winter 2015

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Advanced Computer Architecture

Final Exam Fall 2007

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

CS 4200/5200 Computer Architecture I

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Engineering 9859 CoE Fundamentals Computer Architecture

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

Processor (I) - datapath & control. Hwansoo Han

What is Pipelining. work is done at each stage. The work is not finished until it has passed through all stages.

Chapter 4 The Processor 1. Chapter 4A. The Processor

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

INSTRUCTION LEVEL PARALLELISM

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Transcription:

Pipelining CS701 High Performance Computing

Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument. Patterson and Ditzel. The Case for the Reduced Instruction Set Computer. ACM SIGARCH Computer Architecture News. 8(6). 1980.

Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)

ALU MIPS Datapath 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back

MIPS Pipeline Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013

MIPS Pipeline Events STAGE Any Instruction IF ID ALU Instruction Load or Store Instruction Branch Instruction EX

MIPS Pipeline Events STAGE Any Instruction ALU Instruction Load or Store Instruction Branch Instruction EX MEM WB

MIPS Pipeline 1 2 3 4 5 6 7 8 9 Time (clock cycles) i1 i2 i3 i4... Example: When will i10000 complete? What is the average clock cycles spent per Instruction? If the processor were not pipelined, when will i10000 complete? What is the average clock cycles spent per Instruction? Which is faster?

Speedup of the Pipeline The speedup of a k stage pipelined processor over an unpipelined processor S k = T 1 T k = n k k+(n 1) n: number of instructions in the program. k: number of pipeline stages

Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI

Processor Performance Benchmarks Kernels (e.g. Matrix Multiply), Toy programs (e.g. Sorting), Synthetic benchmarks (e.g. Dhrystone) Desktop Benchmarks SPECInt, SPECfp, SPECpower. CINT2006: perlbench, bzip2, gcc, sjeng, libquantum, h264ref, etc. CFP2006: bwaves, gamess, zeusmp, leslie3d, povray, calculix.lbm, wrf, sphinx3 www.spec.org

Other Benchmark Suites SPLASH Benchmarks Parallel Application Suite Kernels and Applications FFT, BARNES Simulation, LU Decomposition, etc. PARSEC Benchmarks blackscholes, bodytrack, canneal, dedup, fluidanimate, freqmine, raytrace, streamcluster, vips, x264.

Parallelism Increasing Performance Multiple processors, disks, memory banks, pipelining, multiple functional units Focus on the common case Amdahl's Law

Amdahl's Law Program Execution (Original) FP Arithmetic FP Arithmetic Program Execution (Enhanced) FP Arith FP Arith What is the overall speedup by enhancing the performance of a single block? Speedup enhanced = ExecutionTime original ExecutionTime enhancement Speedup enhanced (always >1) Fraction enhanced (always <1)

Amdahl's Law ( Execution Time new = ExecutionTime (1 Fraction )+ Fraction ) enhanced old enhanced Speedup enhanced In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous version. What is the new performance number? Objective: Make the program 10 times faster. Say, 25% of the program is waiting in I/O and cannot be enhanced. How much should the speedup of the enhanced computer be? The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

Instruction Issue MIPS Pipeline Events When an instruction moves into the EX stage after completing the ID stage Instruction Commit When an instruction is guaranteed to commit The instruction updates the state of the processor Branch Delay Clock cycles needed to ascertain whether NPC is to be used or the address after the effective address calculation

Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 Time 1 2 3 4 5 6 (clock cycles) 7 8 9 10 11 ADD J SUB ADD XOR Jump Successor

Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 What is the CPI? What is the throughput of this pipeline? Time 1 2 3 4 5 6 (clock cycles) 7 8 9 10 11 ADD J SUB ADD XOR Jump Successor IF ID IF EX IF MEM IF WB IF IF ID IF EX IF MEM IF WB IF IF ID IF EX IF MEM IF WB IF

Decreasing Branch Delay