PIPELINING AND PROCESSOR PERFORMANCE

Size: px

Start display at page:

Download "PIPELINING AND PROCESSOR PERFORMANCE"

Nathan Ross
6 years ago
Views:

1 PIPELINING AND PROCESSOR PERFORMANCE Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

2 Outline 2 Revision Single cycle processor Multi cycle processor Processor pipelining Instruction flow in a pipeline processor Execution conflicts Evaluating processor performance

3 Revision of a RISC architectures Single cycle processor 3

4 Revision of a RISC architectures Multi-cycle processor 4 Enable &OF Enable &OF EX Enable EX MEM Enable MEM WB Enable WB NEXT PC 4 JMP CTRL COND AD FLAGS Clock + PC Address Instructions Memory COND OP SEL Decoder SEL A SEL B IMM R[AA] MUX SEL A SEL B A ALU OP Flags S Data Data Memory Address WE CLK Data Clock Data INSTRUCTION INST IMM MEM WRITE SEL OUT WE DA BA AA IMM R[BA] MUX B MUX AA A BA B Asynchronous read ports Register File (RF) DA WE DATA Synchronous write ports CLK

5 Instruction flow Single cycle processor 5 Each instruction takes 1 cycle to execute Clock period limited by the worst case path of the whole processor Example: for (i=0,aux=0; i<100; i++) { } if (V[i] > aux){ aux=v[i]; } LOOP_NXT: LI R2,100 SLL R2,R2,2 LI R1,4 MOVE R3,R0 LW R4,100(R1) ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 Some of the used instructions (e.g., LI and MOVE) do not actual exist in MIPS64 instruction set. Hence, they should be replaced with an equivalent instruction such as OR DR,R0,operand. However, they are left here to simplify the reading of the Assembly code. LOOP_END: BNE R2,R1,LOOP_NXT

6 Instruction flow Single cycle processor 6 Each instruction takes 1 cycle to execute Clock period limited by the worst case path of the whole processor Cycle 1 Cycle 2 Cycle 3 Cycle 4 LI R2,100,,EX,MEM,WB SLL R2,R2,2,,EX,MEM,WB LI R1,4,,EX,MEM,WB MOVE R3,R0,,EX,MEM,WB LOOP_NXT: LW R4,100(R1) ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

7 Instruction flow Multi cycle processor 7 Each instruction takes 5 cycles to execute Clock period limited by the worst case path of all stages The working frequency is higher but the instruction throughput is lower LI R2,100 SLL R2,R2,2 LI R1,4 MOVE R3,R0 LOOP_NXT: LW R4,100(R1) ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

8 Instruction flow Pipeline processor 8 The processor simultaneously executes a part of up to 5 different instructions each clock cycle The instruction throughput increases by up to 5x LI R2,100 SLL R2,R2,2 LI R1,4 Pipeline overview during clock cycle 7 MOVE R3,R0 LOOP_NXT: LW R4,100(R1) ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

9 Instruction flow Pipeline processor 9 The instruction throughput can increase by 5x (potential) Much higher performance but LI R2,100 SLL R2,R2,2 LI R1,4 MOVE R3,R0 LOOP_NXT: LW R4,100(R1) ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

10 Instruction flow Pipeline processor 10 The instruction throughput can increase by 5x (potential) Much higher performance but it generates conflicts that must be solved to guarantee the correct behaviour LI R2,100 R2 is valid SLL R2,R2,2 Read R2 EX MEM WB LI R1,4 R1 is valid MOVE R3,R0 R3 is valid LOOP_NXT: LW R4,100(R1) Read R1 EX MEM WB R4 is valid ADDI R1,R1,4 SUB R5,R4,R3 BLEZ R5,LOOP_END MOVE R3,R4 Read R4,R3 EX MEM WB Read R5 EX MEM WB R5 is valid LOOP_END: BNE R2,R1,LOOP_NXT

11 Instruction flow Solving conflicts from pipelining 11 The conflicts can be solved by delaying instruction issue whenever it is necessary Whenever a conflict is found the instruction pipeline is stalled The real instruction throughput from pipelining is smaller than 5x LI R2,100 R2 is valid SLL R2,R2,2 Read R2 EX MEM WB LI R1,4 R1 is valid MOVE R3,R0 R3 is valid LOOP_NXT: LW R4,100(R1) Read R1 EX MEM WB R4 is valid ADDI R1,R1,4 SUB R5,R4,R3 Read R4 EX BLEZ R5,LOOP_END MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

12 Instruction flow Solving conflicts from pipelining 12 The additional logic to detect and solve conflicts increases the clock period The performance increase is even smaller than expected The clock period must increase to allow for conflict detection and resolution LI R2,100 SLL R2,R2,2 Read R2 EX MEM WB LI R1,4 MOVE R3,R0 LOOP_NXT: LW R4,100(R1) Read R1 EX MEM W ADDI R1,R1,4 EX ME SUB R5,R4,R3 Sta BLEZ R5,LOOP_END Sta MOVE R3,R4 LOOP_END: BNE R2,R1,LOOP_NXT

13 Processor performance 13 The processor performance depends on a number of factors: Clock frequency Instruction Set Architecture (e.g., RISC vs CISC) ISA implementation (e.g., single cycle vs multi cycle vs pipeline) Benchmarks (programs) used Compiler optimizations Memory bandwidth and latency What is the best metric to assess processor performance?

14 Measuring processor performance 14 Frequency (GHz) Does not take into account architectural differences (e.g., ISA) MIPS (million instructions per second) MIPS = #Instructions Execution Time [μs] = Clock Frequency CPI 10 6 Valid only when using the exact same program, compiler and OS Requires that both processors use the same ISA 1 CISC instruction is equivalent to several RISC instructions, but takes longer to execute MFLOPS (million floating point operations per second) Has the same problems as the MIPS metric Valid only for floating point intensive programs e.g., does not make sense for H.264 video compression CPI = Cycles Per Instruction

15 Measuring processor performance 15 Use time to measure processor performance Requires the implementation (or at least simulation) of the proposed processor 1 Performance P = Execution Time P What does it mean to say: Processor P A is x times faster than processor P B x = Performance P A Performance P B = Execution Time P B Execuction Time P A Speed-up x is also called the speedup of processor P A versus processor P B

16 Measuring processor performance: Task selection 16 What are the best benchmarks? Real applications of interest to the user Different users different benchmarks Representative programs, e.g., SPEC CPU 2006 Synthetic Programs, e.g., Dhrystone More realistic Fit for real systems Simpler programs Easier to test in simulation Different metrics: Execution rhythm (tasks/second) vs latency (seconds/task)

17 17 Measuring processor performance: SPEC CPU 2006 (integer) Benchmark Lang. Application Area Description 400.Perlbench C Programming Language 401.bzip2 C Compression Derived from Perl V The workload includes SpamAssassin, MHonArc (an indexer), and specdiff (SPEC's tool that checks benchmark outputs). Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O. 403.gcc C C Compiler Based on gcc Version 3.2, generates code for Opteron. 429.mcf C Combinatorial Optimization Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport. 445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game. 456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs) 458.sjeng C Artificial Intelligence: chess A highly-ranked chess program that also plays several chess variants. 462.libquantum C Physics / Quantum Computing 464.h264ref C Video Compression 471.omnetpp C++ Discrete Event Simulation Simulates a quantum computer, running Shor's polynomial-time factorization algorithm. A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2 Uses the OMNet++ discrete event simulator to model a large Ethernet campus network. 473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm. 483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents to other document types.

18 18 Measuring processor performance: SPEC CPU 2006 (floating point) Benchmark Lang. Application Area Description 410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow. 416.gamess Fortran Quantum Chemistry. Gamess implements a wide range of quantum chemical computations. 433.milc C Quantum Chromodynamics A gauge field generating program for lattice gauge theory programs. 434.zeusmp Fortran Physics / CFD Computational fluid dynamics code for simulating of astrophysical phenomena. 435.gromacs Biochemistry / Molecular C,Fortran Dynamics 436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog method 437.leslie3d Fortran Fluid Dynamics Large-Eddy Simulations with Linear-Eddy Model in 3D. 444.namd C++ Biology / Molecular Dynamics Simulates large biomolecular systems. Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution. 447.dealII C++ Finite Element Analysis Program library targeted at adaptive finite elements and error estimation. 450.soplex C++ Linear Programming, Optimization Solves a linear program using a simplex algorithm and sparse linear algebra. 453.povray C++ Image Ray-tracing Image rendering of a 1280x1024 anti-aliased landscape. 454.calculix C,Fortran Structural Mechanics Finite element code for linear and nonlinear 3D structural applications. 459.GemsFDTD Fortran Computational Electromagnetics 465.Tonto Fortran Quantum Chemistry An open source quantum chemistry package 470.lbm C Fluid Dynamics Simulates incompressible fluids in 3D 481.wrf C,Fortran Weather Weather modeling from scales of meters to thousands of kilometers. 482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University Solves the 3D Maxwell equations in 3D using the finite-difference time-domain (FDTD) method.

19 19 Measuring processor performance: Averaging performance Normal All benchmarks have the same weight Weighted (W i ) Benchmarks are weighted by frequency or relevance Arithmetic Mean 1 N T i N W i T i N W i N i=1 i=1 i=1 Harmonic Mean (Less sensitive to large outliers and increases the influence of small values) 1 N N i=1 T i 1 = N i=1 n w i i=1 w i T i n i=1 N 1 T i Alternative: Instead of using Execution Time T i Use speedup regarding a standard reference, Speedup i = SPECs use a SPARStation (SUN Sparc10) as reference T i Ref T i

20 20 Measuring processor performance: Amdahl's Law Consider that we improve processor performance by better designing some part of it E.g., improve floating point calculations by 3x What is the actual improvement? Speedup = T T = T T FP + T(Non FP) The Non-FP instructions have the same execution time: T Non FP = T(Non FP) FP Instructions execution time: T(FP) T FP = = 1 Speedup (FP) 3 T(FP) T Execution time in the original processor T Execution time in the improved processor

21 21 Measuring processor performance: Amdahl's Law Consider that we improve processor performance by better designing some part of it E.g., improve floating point calculations by 3x What is the actual improvement? Speedup = T = 1 T(FP) Speedup(FP) +T(Non FP) T(FP)/T +T(Non FP)/T Speedup(FP) Lets us consider that, in the original processor, the fraction of time executing floating point instructions is α FP = T FP = 0.25 T Speedup = 1 = 1 α(fp) Speedup(FP) + 1 α(fp) = 1.2 T Execution time in the original processor T Execution time in the improved processor

22 22 Measuring processor performance: Amdahl's Law (corollary) Consider that we improve processor performance by better designing some part of it E.g., improve floating point calculations by 3x What is the maximum improvement possible? Speedup = 1 = 1 α(fp) Speedup(FP) + 1 α(fp) = 1.2 Consider that Speedup FP + Maximum achievable Speedup = 1 = 1 = 1.33(3) 1 α(fp) 0.75 T Execution time in the original processor T Execution time in the improved processor

23 Measuring processor performance: Amdahl's Law (summary) 23 Execution Time: T Improved machine = T Reference Machine 1 Fraction Improved + Fraction Improved Speedup Improved Actual Speedup: Speedup Improved machine = 1 Fraction Improved 1 + Fraction Improved Speedup Improved Maximum achievable Speedup (Speedup Improved + ): Maximum Speedup Improved Machine = 1 1 Fraction Improved

24 24 Next lesson More on processor pipelining Conflict identification Solving conflicts

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for