CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Similar documents
CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XV 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Program Optimization. Today. DAC 2001 Tutorial. Overview Generally Useful Optimizations. Optimization Blockers

These Slides. Program Optimization. Optimizing Compilers. Performance Realities. Limitations of Optimizing Compilers. Generally Useful Optimizations

Computer Organization and Architecture

CS 201. Code Optimization, Part 2

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

Performance Optimizations

CS 33. Architecture and Optimization (1) CS33 Intro to Computer Systems XV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Today Overview Generally Useful Optimizations Program Optimization Optimization Blockers Your instructor: Based on slides originally by:

Code Optimization II: Machine Dependent Optimizations

Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002

Previous Best Combining Code The course that gives CMU its Zip!

Task. ! Compute sum of all elements in vector. ! Vector represented by C-style abstract data type. ! Achieved CPE of " Cycles per element

Previous Best Combining Code The course that gives CMU its Zip! Code Optimization II: Machine Dependent Optimizations Oct.

CSC 252: Computer Organization Spring 2018: Lecture 14

Lecture 12 Code Optimization II: Machine Dependent Optimizations. Previous Best Combining Code. Machine Independent Opt. Results

Carnegie Mellon. Papers Practice problems (moodle) Office hours: ping me as needed

Foundations of Computer Systems

Changes made in this version not seen in first lecture:

Code Optimization April 6, 2000

Corrections made in this version not in first posting:

Changes made in this version not seen in first lecture:

Recitation #6 Arch Lab (Y86-64 & O3 recap) October 3rd, 2017

Assembly II: Control Flow

Changelog. More Performance. loop unrolling (ASM) exam graded

Assembly II: Control Flow. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS429: Computer Organization and Architecture

Changelog. Performance. locality exercise (1) a transformation

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

Machine/Assembler Language Putting It All Together

Compiling with Optimizations. CS356 Unit 13. Profiling. gprof Output. Performance. To instrument your code for profiling: $ gcc -pg prog1.

CS 105. Code Optimization and Performance

13.1. CS356 Unit 13. Performance

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

Areas for growth: I love feedback

CS533: Concepts of Operating Systems

Second Part of the Course

There are 16 total numbered pages, 7 Questions. You have 2 hours. Budget your time carefully!

How Software Executes

CS 33. Machine Programming (2) CS33 Intro to Computer Systems XI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Changelog. Corrections made in this version not in first posting:

The Hardware/Software Interface CSE351 Spring 2013

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

Credits and Disclaimers

How Software Executes

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Optimization part 1 1

Intel x86-64 and Y86-64 Instruction Set Architecture

Code Optimization September 27, 2001

CS 261 Fall Mike Lam, Professor. x86-64 Control Flow

x86 Programming I CSE 351 Winter

Lecture 11 Code Optimization I: Machine Independent Optimizations. Optimizing Compilers. Limitations of Optimizing Compilers

Machine Language CS 3330 Samira Khan

Great Reality #4. Code Optimization September 27, Optimizing Compilers. Limitations of Optimizing Compilers

Machine Program: Procedure. Zhaoguo Wang

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

CISC 360. Code Optimization I: Machine Independent Optimizations Oct. 3, 2006

! Must optimize at multiple levels: ! How programs are compiled and executed , F 02

CSE351 Spring 2018, Midterm Exam April 27, 2018

CS , Fall 2004 Exam 1

CS 33. Machine Programming (2) CS33 Intro to Computer Systems XII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Meet & Greet! Come hang out with your TAs and Fellow Students (& eat free insomnia cookies) When : TODAY!! 5-6 pm Where : 3rd Floor Atrium, CIT

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

Recitation #6. Architecture Lab

Data Representa/ons: IA32 + x86-64

The von Neumann Machine

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Do not turn the page until 5:10.

Superscalar Processors

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

How Software Executes

The von Neumann Machine

Lecture 3 CIS 341: COMPILERS

Branching and Looping

Course on Advanced Computer Architectures

Spring 2 Spring Loop Optimizations

%r8 %r8d. %r9 %r9d. %r10 %r10d. %r11 %r11d. %r12 %r12d. %r13 %r13d. %r14 %r14d %rbp. %r15 %r15d. Sean Barker

5.1. CS356 Unit 5. x86 Control Flow

Code Optimization. What is code optimization?

Superscalar Processors

CS 261 Fall Mike Lam, Professor. x86-64 Control Flow

Where We Are. Optimizations. Assembly code. generation. Lexical, Syntax, and Semantic Analysis IR Generation. Low-level IR code.

x86 64 Programming I CSE 351 Autumn 2018 Instructor: Justin Hsia

Assembly I: Basic Operations. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

x86-64 Programming II

Machine-Level Programming II: Control

CS356 Unit 5. Translation to Assembly. Translating HLL to Assembly ASSEMBLY TRANSLATION EXAMPLE. x86 Control Flow

Code Optimization I: Machine Independent Optimizations Oct. 5, 2009

L09: Assembly Programming III. Assembly Programming III. CSE 351 Spring Guest Lecturer: Justin Hsia. Instructor: Ruth Anderson

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

CS429: Computer Organization and Architecture

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Systems I. Code Optimization I: Machine Independent Optimizations

Machine-level Programs Procedure

CS429: Computer Organization and Architecture

CS356: Discussion #15 Review for Final Exam. Marco Paolieri Illustrations from CS:APP3e textbook

12.1. CS356 Unit 12b. Advanced Processor Organization

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Transcription:

CS 33 Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Modern CPU Design Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Operations Address Instructions Instruction Cache Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Data Cache CS33 Intro to Computer Systems XVI 2 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Superscalar Processor Definition: A superscalar processor can issue and execute multiple instructions in one cycle instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically» instructions may be executed out of order Benefit: without programming effort, superscalar processors can take advantage of the instruction-level parallelism that most programs have Most CPUs since about 1998 are superscalar Intel: since Pentium Pro (1995) CS33 Intro to Computer Systems XVI 3 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Multiple Operations per Instruction q %rax, %rdx a single operation q %rax, 8(%rdx) three operations» value from memory» to it the contents of %rax» store result in memory CS33 Intro to Computer Systems XVI 4 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Instruction-Level Parallelism q 8(%rax), %rax q %rbx, %rdx can be executed simultaneously: completely independent q 8(%rax), %rbx q %rbx, %rdx can also be executed simultaneously, but some coordination is required CS33 Intro to Computer Systems XVI 5 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Out-of-Order Execution movss (%rbp), %xmm0 mulss (%rax, %rdx, 4), %xmm0 movss %xmm0, (%rbp) q %r8, %r9 imulq %rcx, %r12 q $1, %rdx these can be executed without waiting for the first three to finish CS33 Intro to Computer Systems XVI 6 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Speculative Execution 80489f3: movl $0x1,%ecx 80489f8: xorq %rdx,%rdx 80489fa: cmpq %rsi,%rdx 80489fc: jnl 8048a25 80489fe: movl %esi,%edi 8048a00: imull (%rax,%rdx,4),%ecx perhaps execute these instructions CS33 Intro to Computer Systems XVI 7 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Haswell CPU Functional Units 1) Integer arithmetic, floating-point multiplication, integer and floating-point division, branches 2) Integer arithmetic, floating-point ition, integer and floatingpoint multiplication 3) Load, ress computation 4) Load, ress computation 5) Store 6) Integer arithmetic 7) Integer arithmetic, branches 8) Store, ress computation CS33 Intro to Computer Systems XVI 8 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Haswell CPU Instruction characteristics Instruction Latency Cycles/Issue Capacity Integer Add 1 1 4 Integer Multiply 3 1 1 Integer/Long Divide 3-30 3-30 1 Single/Double FP Add 3 1 1 Single/Double FP Multiply 5 1 2 Single/Double FP Divide 3-15 3-15 1 CS33 Intro to Computer Systems XVI 9 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Haswell CPU Performance Bounds Integer Floating Point + * + * Latency 1.00 3.00 3.00 5.00 Throughput 0.50 1.00 1.00 0.50 CS33 Intro to Computer Systems XVI 10 Copyright 2017 Thomas W. Doeppner. All rights reserved.

x86-64 Compilation of Combine4 Inner loop (case: integer multiply).l519: # Loop: imull (%rax,%rdx,4), %ecx # t = t * d[i] q $1, %rdx # i++ cmpq %rdx, %rbp # Compare length:i jg.l519 # If >, goto Loop Method Integer Double FP Operation Add Mult Add Mult Combine4 1.27 3.01 3.01 5.01 Latency bound 1.00 3.00 3.00 5.0 Throughput bound 0.50 1.00 1.00 0.50 CS33 Intro to Computer Systems XVI 11 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Inner Loop %rax %rbp %rdx %xmm0 mul cmp jg mulss (%rax,%rdx,4), %xmm0 q $1,%rdx cmpq %rdx,%rbp jg loop %rax %rbp %rdx %xmm0 CS33 Intro to Computer Systems XVI 12 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Data-Flow Graphs of Inner Loop %xmm0 %rax %rbp %rdx data[i] %xmm0 %rdx mul mul cmp %xmm0 %rdx jg %xmm0 %rdx CS33 Intro to Computer Systems XVI 13 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Relative Execution Times %xmm0 %rdx data[i] mul %xmm0 %rdx CS33 Intro to Computer Systems XVI 14 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Critical path Data Flow Over Multiple Iterations data[0] mul data[1] mul data[n-2] mul data[n-1] mul CS33 Intro to Computer Systems XVI 15 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Pipelined Data-Flow Over Multiple Iterations mul mul mul CS33 Intro to Computer Systems XVI 16 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Pipelined Data-Flow Over Multiple Iterations mul mul mul CS33 Intro to Computer Systems XVI 17 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Pipelined Data-Flow Over Multiple Iterations mul mul mul CS33 Intro to Computer Systems XVI 18 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Combine4 = Serial Computation (OP = *) 1 d 0 * * d 1 * d 2 Computation (length=8) ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) * d[4]) * d[5]) * d[6]) * d[7]) Sequential dependence d 3 performance: determined by latency of OP * d 4 * d 5 * d 6 * d 7 * CS33 Intro to Computer Systems XVI 19 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Loop Unrolling void unroll2x(vec_ptr_t v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } Perform 2x more useful work per iteration CS33 Intro to Computer Systems XVI 20 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Loop Unrolling void unroll2x(vec_ptr_t v, data_t *dest) { int length = vec_length(v); a) yes int limit = length-1; data_t *d = get_vec_start(v); b) no data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } Quiz 1 Does it speed things up by allowing more parallelism? Perform 2x more useful work per iteration CS33 Intro to Computer Systems XVI 21 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Effect of Loop Unrolling Method Integer Double FP Operation Add Mult Add Mult Combine4 1.27 3.01 3.01 5.01 Unroll 2x 1.01 3.01 3.01 5.01 Latency bound 1.0 3.0 3.0 5.0 Throughput bound Helps integer reduces loop overhead Others don t improve. Why? still sequential dependency 0.5 1.0 1.0 0.5 x = (x OP d[i]) OP d[i+1]; CS33 Intro to Computer Systems XVI 22 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Loop Unrolling with Reassociation void unroll2xra(vec_ptr_t v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } Can this change the result of the computation? Yes, for FP. Why? Compare to before x = (x OP d[i]) OP d[i+1]; CS33 Intro to Computer Systems XVI 23 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Effect of Reassociation Method Integer Double FP Operation Add Mult Add Mult Combine4 1.27 3.01 3.01 5.01 Unroll 2x 1.01 3.01 3.01 5.01 Unroll 2x, reassociate 1.01 1.51 1.51 2.51 Latency bound 1.0 3.0 3.0 5.0 Throughput bound.5 1.0 1.0.5 Nearly 2x speedup for int *, FP +, FP * reason: breaks sequential dependency x = x OP (d[i] OP d[i+1]); why is that? (next slide) CS33 Intro to Computer Systems XVI 24 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Reassociated Computation x = x OP (d[i] OP d[i+1]); d 0 d 1 What changed: ops in the next iteration can be started early (no dependency) 1 * d 2 d 3 Overall Performance * * * d 4 d 5 * * d 6 d 7 * N elements, D cycles latency/op should be (N/2+1)*D cycles: CPE = D/2 measured CPE slightly worse for integer ition * CS33 Intro to Computer Systems XVI 25 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Loop Unrolling with Separate Accumulators void unroll2xp2x(vec_ptr_t v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x0 = IDENT; data_t x1 = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 = x0 OP d[i]; } *dest = x0 OP x1; } Different form of reassociation CS33 Intro to Computer Systems XVI 26 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Effect of Separate Accumulators Method Integer Double FP Operation Add Mult Add Mult Combine4 1.27 3.01 3.01 5.01 Unroll 2x 1.01 3.01 3.01 5.01 Unroll 2x, reassociate 2x speedup (over unroll 2x) for int *, FP +, FP * breaks sequential dependency in a cleaner, more obvious way x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; 1.01 1.51 1.51 2.01 Unroll 2x parallel 2x.81 1.51 1.51 2.51 Latency bound 1.0 3.0 3.0 5.0 Throughput bound.5 1.0 1.0.5 CS33 Intro to Computer Systems XVI 27 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Separate Accumulators x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; What changed: two independent streams of operations 1 d 0 * * d 2 * 1 d 1 * d 4 d 6 * d 3 * d 5 d 7 Overall Performance N elements, D cycles latency/op should be (N/2+1)*D cycles: CPE = D/2 Integer ition improved, but not yet at predicted value * * * What Now? CS33 Intro to Computer Systems XVI 28 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Quiz 2 With 3 accumulators there will be 3 independent streams of instructions; with 4 accumulators 4 independent streams of instructions, etc. Thus with n accumulators we can have a speedup of O(n), as long as n is no greater than the number of available registers. a) true b) false CS33 Intro to Computer Systems XVI 29 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Performance 6 5 4 double * CPE 3 double + 2 long * long + 1 0 1 2 3 4 5 6 7 8 9 10 Unrolling factor k K-way loop unrolling with K accumulators limited by number and throughput of functional units CS33 Intro to Computer Systems XVI 30 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Achievable Performance Method Integer Double FP Operation Add Mult Add Mult Achievable scalar.54 1.01 1.01.520 Latency bound 1.00 3.00 3.00 5.00 Throughput bound.5 1.00 1.00.5 CS33 Intro to Computer Systems XVI 31 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Using Vector Instructions Method Integer Double FP Operation Add Mult Add Mult Achievable Scalar.54 1.01 1.01.520 Latency bound 1.00 3.00 3.00 5.00 Throughput bound.5 1.00 1.00.5 Achievable Vector.05.24.25.16 Vector throughput bound.06.12.25.12 Make use of SSE Instructions parallel operations on multiple data elements CS33 Intro to Computer Systems XVI 32 Copyright 2017 Thomas W. Doeppner. All rights reserved.

What About Branches? Challenge instruction control unit must work well ahead of execution unit to generate enough operations to keep EU busy 80489f3: movl $0x1,%ecx 80489f8: xorq %rdx,%rdx 80489fa: cmpq %rsi,%rdx 80489fc: jnl 8048a25 80489fe: movl %esi,%edi 8048a00: imull (%rax,%rdx,4),%ecx Executing How to continue? when it encounters conditional branch, cannot reliably determine where to continue fetching CS33 Intro to Computer Systems XVI 33 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Modern CPU Design Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Operations Address Instructions Instruction Cache Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Data Cache CS33 Intro to Computer Systems XVI 34 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching branch taken: transfer control to branch target branch not-taken: continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit 80489f3: movl $0x1,%ecx 80489f8: xorq %rdx,%rdx 80489fa: cmpq %rsi,%rdx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%rax,%rdx,4),%ecx Branch not-taken 8048a25: cmpq %rdi,%rdx 8048a27: jl 8048a20 8048a29: movl 0xc(%rbp),%eax 8048a2c: leal 0xffffffe8(%rbp),%esp 8048a2f: movl %ecx,(%rax) Branch taken CS33 Intro to Computer Systems XVI 35 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Branch Prediction Idea guess which way branch will go begin executing instructions at predicted position» but don t actually modify register or memory data 80489f3: movl $0x1,%ecx 80489f8: xorq %edx,%edx 80489fa: cmpq %rsi,%rdx 80489fc: jnl 8048a25... Predict taken 8048a25: cmpq %rdi,%rdx 8048a27: jl 8048a20 8048a29: movl 0xc(%rbp),%eax 8048a2c: leal 0xffffffe8(%rbp),%esp 8048a2f: movl %ecx,(%rax) Begin execution CS33 Intro to Computer Systems XVI 36 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Branch Prediction Through Loop 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = 98 80488b9: jl 80488b1 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = 99 80488b9: jl 80488b1 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 100 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 101 Assume vector length = 100 Predict taken (OK) Predict taken (oops) Read invalid location Executed Fetched CS33 Intro to Computer Systems XVI 37 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Branch Misprediction Invalidation 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = 98 80488b9: jl 80488b1 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = 99 80488b9: jl 80488b1 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 100 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx i = 101 Assume vector length = 100 Predict taken (OK) Predict taken (oops) Invalidate CS33 Intro to Computer Systems XVI 38 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Branch Misprediction Recovery 80488b1: movl (%rcx,%rdx,4),%eax 80488b4: l %eax,(%rdi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 99 80488bb: leal 0xffffffe8(%rbp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi Definitely not taken Performance Cost multiple clock cycles on modern processor can be a major performance limiter CS33 Intro to Computer Systems XVI 39 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Conditional Moves void minmax1(long *a, long *b, long n) { long i; for (i=0; i<n; i++) { if (a[i] > b[i]) { long t = a[i]; a[i] = b[i]; b[i] = t; } } } Compiled code uses conditional branch 13.5 CPE for random data 2.5 3.5 CPE for predictable data void minmax2(long *a, long *b, long n) { long i; for (i=0; i<n; i++) { long min = a[i] < b[i]? a[i] : b[i]; long max = a[i] < b[i]? b[i] : a[i]; a[i] = min; b[i] = max; } } Compiled code uses conditional move instruction 4.0 CPE regardless of data s pattern CS33 Intro to Computer Systems XVI 40 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Latency of Loads typedef struct ELE { struct ELE *next; long data; } list_ele, *list_ptr; int list_len(list_ptr ls) { long len = 0; } while (ls) { len++; ls = ls->next; } return len; # len in %rax, ls in %rdi.l11: # loop: q $1, %rax # incr len movq (%rdi), %rdi # ls = ls->next testq %rdi, %rdi # test ls jne.l11 # if!= 0 # go to loop 4 CPE CS33 Intro to Computer Systems XVI 41 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Clearing an Array... #define ITERS 100000000 void clear_array() { long dest[100]; int iter; for (iter=0; iter<iters; iter++) { long i; for (i=0; i<100; i++) dest[i] = 0; } } 1 CPE CS33 Intro to Computer Systems XVI 42 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Store/Load Interaction void write_read(long *src, long *dest, long n) { long cnt = n; long val = 0; } while(cnt--) { *dest = val; val = (*src)+1; } CS33 Intro to Computer Systems XVI 43 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Store/Load Interaction long a[] = {-10, 17}; Example A: write_read(&a[0],&a[1],3) Initial Iter. 1 Iter. 2 Iter. 3 cnt a 3 10 17 2 10 0 1 10 9 0 10 9 CPE 1.3 val 0 9 9 9 Example B: write_read(&a[0],&a[0],3) cnt a val Initial 3 10 17 0 Iter. 1 2 0 17 1 Iter. 2 1 1 17 2 Iter. 3 0 2 17 3 CPE 7.3 CS33 Intro to Computer Systems XVI 44 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Some Details of Load and Store Load unit Address Data Matching resses Store unit Store buffer Address Data Address Data Address Data Data cache CS33 Intro to Computer Systems XVI 45 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Inner-Loop Data Flow of Write_Read %rax %rbx recx %rdx s_r s_data sub jne movq %rax,(%rcx) movq (%rbx),%rax q $1,%rax subq $1,%rdx jne loop *dest = val; val = *src val++; cnt--; %rax %rbx %rcx %rdx CS33 Intro to Computer Systems XVI 46 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Inner-Loop Data Flow of Write_Read %rax %rbx %rcx %rdx %rax %rdx 1 s_data s_r 2 s_data 3 sub jne sub %rax %rdx %rax %rdx CS33 Intro to Computer Systems XVI 47 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Data Flow s_data Critical path Critical path s_data sub sub s_data s_data sub sub s_data s_data sub sub CS33 Intro to Computer Systems XVI 48 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Getting High Performance Good compiler and flags Don t do anything stupid watch out for hidden algorithmic inefficiencies write compiler-friendly code» watch out for optimization blockers: procedure calls & memory references look carefully at innermost loops (where most work is done) Tune code for machine exploit instruction-level parallelism avoid unpredictable branches make code cache friendly (covered soon) CS33 Intro to Computer Systems XVI 49 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Hyper Threading Instruction Control Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Address Instructions Instruction Cache Retirement Unit Register File Fetch Control Instruction Decode Address Instructions Instruction Cache Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Data Cache CS33 Intro to Computer Systems XVI 50 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Chip Multiple Cores Instruction Control Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Address Instructions Instruction Cache Retirement Unit Register File Fetch Control Instruction Decode Address Instructions Instruction Cache Operations Operations Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Operation Results Addr. Addr. Data Data Data Data Data Cache Data Cache Execution Execution Other Stuff More Cache Other Stuff CS33 Intro to Computer Systems XVI 51 Copyright 2017 Thomas W. Doeppner. All rights reserved.