Pipelining. CS701 High Performance Computing

Pipelining CS701 High Performance Computing

Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument. Patterson and Ditzel. The Case for the Reduced Instruction Set Computer. ACM SIGARCH Computer Architecture News. 8(6). 1980.

Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)

ALU MIPS Datapath 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back

MIPS Pipeline Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013

MIPS Pipeline Events STAGE Any Instruction IF ID ALU Instruction Load or Store Instruction Branch Instruction EX

MIPS Pipeline Events STAGE Any Instruction ALU Instruction Load or Store Instruction Branch Instruction EX MEM WB

MIPS Pipeline 1 2 3 4 5 6 7 8 9 Time (clock cycles) i1 i2 i3 i4... Example: When will i10000 complete? What is the average clock cycles spent per Instruction? If the processor were not pipelined, when will i10000 complete? What is the average clock cycles spent per Instruction? Which is faster?

Speedup of the Pipeline The speedup of a k stage pipelined processor over an unpipelined processor S k = T 1 T k = n k k+(n 1) n: number of instructions in the program. k: number of pipeline stages

Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI

Processor Performance Benchmarks Kernels (e.g. Matrix Multiply), Toy programs (e.g. Sorting), Synthetic benchmarks (e.g. Dhrystone) Desktop Benchmarks SPECInt, SPECfp, SPECpower. CINT2006: perlbench, bzip2, gcc, sjeng, libquantum, h264ref, etc. CFP2006: bwaves, gamess, zeusmp, leslie3d, povray, calculix.lbm, wrf, sphinx3 www.spec.org

Other Benchmark Suites SPLASH Benchmarks Parallel Application Suite Kernels and Applications FFT, BARNES Simulation, LU Decomposition, etc. PARSEC Benchmarks blackscholes, bodytrack, canneal, dedup, fluidanimate, freqmine, raytrace, streamcluster, vips, x264.

Parallelism Increasing Performance Multiple processors, disks, memory banks, pipelining, multiple functional units Focus on the common case Amdahl's Law

Amdahl's Law Program Execution (Original) FP Arithmetic FP Arithmetic Program Execution (Enhanced) FP Arith FP Arith What is the overall speedup by enhancing the performance of a single block? Speedup enhanced = ExecutionTime original ExecutionTime enhancement Speedup enhanced (always >1) Fraction enhanced (always <1)

Amdahl's Law ( Execution Time new = ExecutionTime (1 Fraction )+ Fraction ) enhanced old enhanced Speedup enhanced In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous version. What is the new performance number? Objective: Make the program 10 times faster. Say, 25% of the program is waiting in I/O and cannot be enhanced. How much should the speedup of the enhanced computer be? The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

Instruction Issue MIPS Pipeline Events When an instruction moves into the EX stage after completing the ID stage Instruction Commit When an instruction is guaranteed to commit The instruction updates the state of the processor Branch Delay Clock cycles needed to ascertain whether NPC is to be used or the address after the effective address calculation

Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 Time 1 2 3 4 5 6 (clock cycles) 7 8 9 10 11 ADD J SUB ADD XOR Jump Successor

Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 What is the CPI? What is the throughput of this pipeline? Time 1 2 3 4 5 6 (clock cycles) 7 8 9 10 11 ADD J SUB ADD XOR Jump Successor IF ID IF EX IF MEM IF WB IF IF ID IF EX IF MEM IF WB IF IF ID IF EX IF MEM IF WB IF

Decreasing Branch Delay