Optimization part 1 1

Similar documents
Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

Changelog. Performance. locality exercise (1) a transformation

Changelog. Corrections made in this version not in first posting:

Corrections made in this version not in first posting:

Changes made in this version not seen in first lecture:

Cache Performance II 1

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Changelog. More Performance. loop unrolling (ASM) exam graded

Changes made in this version not seen in first lecture:

last time out-of-order execution and instruction queues the data flow model idea

CS 33. Architecture and Optimization (1) CS33 Intro to Computer Systems XV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Machine/Assembler Language Putting It All Together

CSC 252: Computer Organization Spring 2018: Lecture 14

selected anonymous feedback: VM

Stack Frame Components. Using the Stack (4) Stack Structure. Updated Stack Structure. Caller Frame Arguments 7+ Return Addr Old %rbp

Branching and Looping

selected anonymous feedback: VM Exam Review final format/content logistics

CS 3330 Exam 3 Fall 2017 Computing ID:

Machine Programming 3: Procedures

Basic Data Types. CS429: Computer Organization and Architecture. Array Allocation. Array Access

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Princeton University Computer Science 217: Introduction to Programming Systems. Assembly Language: Function Calls

Machine Program: Procedure. Zhaoguo Wang

Data Representa/ons: IA32 + x86-64

SYSTEMS PROGRAMMING AND COMPUTER ARCHITECTURE Assignment 5: Assembly and C

The Stack & Procedures

Computer Organization and Architecture

CSE351 Spring 2018, Midterm Exam April 27, 2018

CSE 351 Midterm - Winter 2015 Solutions

CSC 252: Computer Organization Spring 2018: Lecture 6

Areas for growth: I love feedback

CS429: Computer Organization and Architecture

Procedures and the Call Stack

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

Changelog. Performance (finish) / Exceptions. other vector instructions. alternate vector interfaces

Assembly II: Control Flow. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Hardware/Software Interface CSE351 Spring 2013

Machine Language CS 3330 Samira Khan

Machine Level Programming: Arrays, Structures and More

Assembly II: Control Flow

x86 Programming II CSE 351 Winter

Machine-level Programs Data

Changelog. Assembly (part 1) logistics note: lab due times. last time: C hodgepodge

CSE 351 Section 4 GDB and x86-64 Assembly Hi there! Welcome back to section, we re happy that you re here

Corrections made in this version not seen in first lecture:

last time Assembly part 2 / C part 1 condition codes reminder: quiz

How Software Executes

Program Optimization. Today. DAC 2001 Tutorial. Overview Generally Useful Optimizations. Optimization Blockers

CSE351 Autumn 2014 Midterm Exam (29 October 2014)

Important From Last Time

CS356: Discussion #15 Review for Final Exam. Marco Paolieri Illustrations from CS:APP3e textbook

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Important From Last Time

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XV 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

x86 64 Programming II

1. A student is testing an implementation of a C function; when compiled with gcc, the following x86-64 assembly code is produced:

Lecture 7: More procedures; array access Computer Architecture and Systems Programming ( )

Performance Optimizations

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

Brief Assembly Refresher

CS429: Computer Organization and Architecture

CSE 351 Midterm - Winter 2015

Assembly I: Basic Operations. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Credits and Disclaimers

COMP 210 Example Question Exam 2 (Solutions at the bottom)

Machine-Level Programming (2)

x86-64 Programming II

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Changelog. Brief Assembly Refresher. a logistics note. last time

Do not turn the page until 5:10.

Assembly IV: Complex Data Types. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

The Stack & Procedures

u Arrays One-dimensional Multi-dimensional (nested) Multi-level u Structures Allocation Access Alignment u Floating Point

Machine-Level Programming III: Procedures

CSE 351 Midterm - Winter 2017

CSCI 2021: x86-64 Control Flow

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Computer Organization: A Programmer's Perspective

Machine-level Programs Procedure

CS 107. Lecture 13: Assembly Part III. Friday, November 10, Stack "bottom".. Earlier Frames. Frame for calling function P. Increasing address

EXAMINATIONS 2014 TRIMESTER 1 SWEN 430. Compiler Engineering. This examination will be marked out of 180 marks.

Assembly IV: Complex Data Types

GCC and Assembly language. GCC and Assembly language. Consider an example (dangeous) foo.s

CSE351 Autumn 2014 Midterm Exam (29 October 2014)

CS429: Computer Organization and Architecture

CPS104 Recitation: Assembly Programming

%r8 %r8d. %r9 %r9d. %r10 %r10d. %r11 %r11d. %r12 %r12d. %r13 %r13d. %r14 %r14d %rbp. %r15 %r15d. Sean Barker

University of Washington

Controlling Program Flow

Question Points Score Total: 100

CS 3330 Exam 2 Fall 2017 Computing ID:

Lecture 16 Optimizing for the memory hierarchy

x86 Programming I CSE 351 Winter

Process Layout and Function Calls

Many of the following slides are taken with permission from. The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

CS367. Program Control

Machine Level Programming: Control

Question Points Score Total: 100

Transcription:

Optimization part 1 1

Changelog 1 Changes made in this version not seen in first lecture: 29 Feb 2018: loop unrolling performance: remove bogus instruction cache overhead remark 29 Feb 2018: spatial locality in Akj: correct reference to B k+1,j to A k+1,j

last time 2 what things in C code map to same set? key idea: if bytes per way apart from each other finding conflict misses in C how overloaded is each cache set cache blocking for matrix-like code maximize work per cache miss

some logistics 3 exam next week everything up to and including this lecture yes, I know office hours were very slow like to think about how to help with group office hours? better tools? different priorities on queue?

view as an explicit cache 4 imagine we explicitly moved things into cache original loop: for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) { loadintocache(&a[i*n+k]); for (int j = 0; j < N; ++j) { loadintocache(&b[i*n+j]); loadintocache(&a[k*n+j]); B[i*N+j] += A[i*N+k] * A[k*N+j];

view as an explicit cache imagine we explicitly moved things into cache blocking in k: for (int kk = 0; kk < N; kk += 2) for (int i = 0; i < N; ++i) { loadintocache(&a[i*n+k]); loadintocache(&a[i*n+k+1]); for (int j = 0; j < N; ++j) { loadintocache(&b[i*n+j]); loadintocache(&a[k*n+j]); loadintocache(&a[(k+1)*n+j]); for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; 5

calculation counting with explicit cache 6 before: load 2 values for one add+multiply after: load 3 values for two add+multiply

7 simple blocking: temporal locality in Bij for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before: B ij s accessed once, then not again for N 2 iters after: B ij s accessed twice, then not again for N 2 iters (next k)

8 simple blocking: temporal locality in Akj for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before blocking: A kj s accessed once, then not again for N iters after blocking: A kj s accessed twice, then not again for N iters (next i)

9 simple blocking: temporal locality in Aik for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before: A ik s accessed N times, then never again after: A ik s accessed N times but other parts of A ik accessed in between slightly less temporal locality

simple blocking: spatial locality in Bij for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before blocking: perfect spatial locality (B i,j and B i,j+1 adjacent) after blocking: slightly less spatial locality B i,j and B i+1,j far apart (N elements) but still B i,j+1 accessed iteration after B i,j (adjacent) 10

simple blocking: spatial locality in Akj 11 for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before: perfect spatial locality (A k,j and B k,j+1 adjacent) after: slightly less spatial locality A k,j and A k+1,j far apart (N elements) but still A k,j+1 accessed iteration after B k,j (adjacent)

simple blocking: spatial locality in Aik 12 for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) /* load a block around Aik */ for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j before: very poor spatial locality (A i,k and A i+1,k far apart) after: some spatial locality A i,k and B i+1,k still far apart (N elements) but still A i,k accessed together with A i,k+1

generalizing cache blocking for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 13

generalizing cache blocking for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 13

generalizing cache blocking for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 13

generalizing cache blocking for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 13

cache blocking overview 14 reorder calculations typically work in square-ish chunks of input goal: maximum calculations per load into cache typically: use every value several times after loading it versus naive loop code: some values loaded, then used once some values loaded, then used all possible times

cache blocking and miss rate 15 0.09 read misses/multiply or add 0.08 blocked unblocked 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 100 200 300 400 500 N

16 what about performance? 2.0 cycles per multiply/add [less optimized loop] 1.5 1.0 0.5 0.0 0.4 0.3 0.2 0.1 0.0 unblocked blocked 0 100 200 300 400 500 N 0.5 cycles per multiply/add [optimized loop] unblocked blocked 0 200 400 600 800 1000 N

performance for big sizes 17 1.0 matrix in 0.8 L3 cache 0.6 0.4 0.2 cycles per multiply or add unblocked blocked 0.0 0 2000 4000 6000 8000 10000 N

optimized loop??? 18 performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations

optimized loop??? 18 performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations but how can that make cache blocking better???

overlapping loads and arithmetic 19 time load load ltiply dd multiply multiply multiply multiply add add add speed of load might not matter if these are slower

optimization and bottlenecks 20 arithmetic/loop efficiency was the bottleneck after fixing this, cache performance was the bottleneck common theme when optimizing: X may not matter until Y is optimized

example assembly (unoptimized) long sum(long *A, int N) { long result = 0; for (int i = 0; i < N; ++i) result += A[i]; return result; sum:... the_loop:... leaq 0(,%rax,8), %rdx// offset i * 8 movq 24(%rbp), %rax // get A from stack addq %rdx, %rax // add offset movq (%rax), %rax // get *(A+offset) addq %rax, 8(%rbp) // add to sum, on stack addl $1, 12(%rbp) // increment i condition: movl 12(%rbp), %eax cmpl 28(%rbp), %eax jl the_loop 21

example assembly (gcc 5.4 -Os) long sum(long *A, int N) { long result = 0; for (int i = 0; i < N; ++i) result += A[i]; return result; sum: xorl xorl the_loop: cmpl jle addq incq jmp done: ret %edx, %edx %eax, %eax %edx, %esi done (%rdi,%rdx,8), %rax %rdx the_loop 22

example assembly (gcc 5.4 -O2) long sum(long *A, int N) { long result = 0; for (int i = 0; i < N; ++i) result += A[i]; return result; sum: testl %esi, %esi jle return_zero leal 1(%rsi), %eax leaq 8(%rdi,%rax,8), %rdx // rdx=end of A xorl %eax, %eax the_loop: addq (%rdi), %rax // add to sum addq $8, %rdi // advance pointer cmpq %rdx, %rdi jne the_loop rep ret return_zero:... 23

optimizing compilers 24 these usually make your code fast often not done by default compilers and humans are good at different kinds of optimizations

compiler limitations 25 needs to generate code that does the same thing even in corner cases that obviously don t matter often doesn t look into a method needs to assume it might do anything can t predict what inputs/values will be e.g. lots of loop iterations or few? can t understand code size versus speed tradeoffs

compiler limitations 26 needs to generate code that does the same thing even in corner cases that obviously don t matter often doesn t look into a method needs to assume it might do anything can t predict what inputs/values will be e.g. lots of loop iterations or few? can t understand code size versus speed tradeoffs

aliasing 27 void twiddle(long *px, long *py) { *px += *py; *px += *py; the compiler cannot generate this: twiddle: // BROKEN // %rsi = px, %rdi = py movq (%rdi), %rax // rax *py addq %rax, %rax // rax 2 * *py addq %rax, (%rsi) // *px 2 * *py ret

aliasing problem 28 void twiddle(long *px, long *py) { *px += *py; *px += *py; // NOT the same as *px += 2 * *py;... long x = 1; twiddle(&x, &x); // result should be 4, not 3 twiddle: // BROKEN // %rsi = px, %rdi = py movq (%rdi), %rax // rax *py addq %rax, %rax // rax 2 * *py addq %rax, (%rsi) // *px 2 * *py ret

non-contrived aliasing 29 void sumrows1(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { result[row] = 0; for (int col = 0; col < N; ++col) result[row] += matrix[row * N + col];

non-contrived aliasing 29 void sumrows1(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { result[row] = 0; for (int col = 0; col < N; ++col) result[row] += matrix[row * N + col]; void sumrows2(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { int sum = 0; for (int col = 0; col < N; ++col) sum += matrix[row * N + col]; result[row] = sum;

aliasing and performance (1) / GCC 5.4 -O2 30 3.0 2.5 2.0 cycles/count 1.5 1.0 0.5 0.0 0 200 400 600 800 1000 N

aliasing and performance (2) / GCC 5.4 -O3 31 3.0 2.5 2.0 cycles/count 1.5 1.0 0.5 0.0 0 200 400 600 800 1000 N

automatic register reuse 32 Compiler would need to generate overlap check: if (result > matrix + N * N result < matrix) { for (int row = 0; row < N; ++row) { int sum = 0; /* kept in register */ for (int col = 0; col < N; ++col) sum += matrix[row * N + col]; result[row] = sum; else { for (int row = 0; row < N; ++row) { result[row] = 0; for (int col = 0; col < N; ++col) result[row] += matrix[row * N + col];

aliasing and cache optimizations 33 for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; for (int i = 0; i < N; ++i) for (int j = 0; k < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; B = A? B = &A[10]? compiler can t generate same code for both

aliasing problems with cache blocking 34 for (int k = 0; k < N; k++) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; j += 2) { B[(i+0)*N + j+0] += A[i*N+k] * A[k*N+j]; B[(i+1)*N + j+0] += A[(i+1)*N+k] * A[k*N+j]; B[(i+0)*N + j+1] += A[i*N+k] * A[k*N+j+1]; B[(i+1)*N + j+1] += A[(i+1)*N+k] * A[k*N+j+1]; can compiler keep A[i*N+k] in a register?

register blocking 35 for (int k = 0; k < N; ++k) { for (int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for (int j = 0; j < N; j += 2) { float Akj0 = A[k*N + j+0]; float Akj1 = A[k*N + j+1]; B[(i+0)*N + j+0] += Ai0k * Akj0; B[(i+1)*N + j+0] += Ai1k * Akj0; B[(i+0)*N + j+1] += Ai0k * Akj1; B[(i+1)*N + j+1] += Ai1k * Akj1;

avoiding redundant loads summary 36 move repeated load outside of loop create variable tell compiler not aliased

aside: the restrict hint 37 C has a keyword restrict for pointers I promise this pointer doesn t alias another (if it does undefined behavior) maybe will help compiler do optimization itself? void square(float * restrict B, float * restrict A) {...

compiler limitations 38 needs to generate code that does the same thing even in corner cases that obviously don t matter often doesn t look into a method needs to assume it might do anything can t predict what inputs/values will be e.g. lots of loop iterations or few? can t understand code size versus speed tradeoffs

loop with a function call 39 int addwithlimit(int x, int y) { int total = x + y; if (total > 10000) return 10000; else return total;... int sum(int *array, int n) { int sum = 0; for (int i = 0; i < n; i++) sum = addwithlimit(sum, array[i]); return sum;

loop with a function call 39 int addwithlimit(int x, int y) { int total = x + y; if (total > 10000) return 10000; else return total;... int sum(int *array, int n) { int sum = 0; for (int i = 0; i < n; i++) sum = addwithlimit(sum, array[i]); return sum;

function call assembly 40 movl (%rbx), %esi // mov array[i] movl %eax, %edi // mov sum call addwithlimit extra instructions executed: two moves, a call, and a ret

manual inlining 41 int sum(int *array, int n) { int sum = 0; for (int i = 0; i < n; i++) { sum = sum + array[i]; if (sum > 10000) sum = 10000; return sum;

inlining pro/con 42 avoids call, ret, extra move instructions allows compiler to use more registers no caller-saved register problems but not always faster: worse for instruction cache (more copies of function body code)

compiler inlining 43 compilers will inline, but will usually avoid making code much bigger heuristic: inline if function is small enough heuristic: inline if called exactly once will usually not inline across.o files some compilers allow hints to say please inline/do not inline this function

remove redundant operations (1) 44 char number_of_as(const char *str) { int count = 0; for (int i = 0; i < strlen(str); ++i) { if (str[i] == 'a') count++; return count;

remove redundant operations (1, fix) 45 int number_of_as(const char *str) { int count = 0; int length = strlen(str); for (int i = 0; i < length; ++i) { if (str[i] == 'a') count++; return count; call strlen once, not once per character! Big-Oh improvement!

remove redundant operations (1, fix) 45 int number_of_as(const char *str) { int count = 0; int length = strlen(str); for (int i = 0; i < length; ++i) { if (str[i] == 'a') count++; return count; call strlen once, not once per character! Big-Oh improvement!

remove redundant operations (2) 46 int shiftarray(int *source, int *dest, int N, int amount) { for (int i = 0; i < N; ++i) { if (i + amount < N) dest[i] = source[i + amount]; else dest[i] = source[n 1]; compare i + amount to N many times

remove redundant operations (2, fix) 47 int shiftarray(int *source, int *dest, int N, int amount) { int i; for (i = 0; i + amount < N; ++i) { dest[i] = source[i + amount]; for (; i < N; ++i) { dest[i] = source[n 1]; eliminate comparisons

compiler limitations 48 needs to generate code that does the same thing even in corner cases that obviously don t matter often doesn t look into a method needs to assume it might do anything can t predict what inputs/values will be e.g. lots of loop iterations or few? can t understand code size versus speed tradeoffs

loop unrolling (ASM) loop: cmpl jle addq incq jmp endofloop: %edx, %esi endofloop (%rdi,%rdx,8), %rax %rdx loop: cmpl %edx, %esi jle endofloop addq (%rdi,%rdx,8), %rax addq 8(%rdi,%rdx,8), %rax addq $2, %rdx jmp loop // plus handle leftover? endofloop: 49

loop unrolling (ASM) loop: cmpl jle addq incq jmp endofloop: %edx, %esi endofloop (%rdi,%rdx,8), %rax %rdx loop: cmpl %edx, %esi jle endofloop addq (%rdi,%rdx,8), %rax addq 8(%rdi,%rdx,8), %rax addq $2, %rdx jmp loop // plus handle leftover? endofloop: 49

loop unrolling (C) 50 for (int i = 0; i < N; ++i) sum += A[i]; int i; for (i = 0; i + 1 < N; i += 2) { sum += A[i]; sum += A[i+1]; // handle leftover, if needed if (i < N) sum += A[i];

more loop unrolling (C) 51 int i; for (i = 0; i + 4 <= N; i += 4) { sum += A[i]; sum += A[i+1]; sum += A[i+2]; sum += A[i+3]; // handle leftover, if needed for (; i < N; i += 1) sum += A[i];

automatic loop unrolling 52 loop unrolling is easy for compilers but often not done or done very much why not?

automatic loop unrolling 52 loop unrolling is easy for compilers but often not done or done very much why not? slower if small number of iterations larger code could exceed instruction cache space

loop unrolling performance 53 on my laptop with 992 elements (fits in L1 cache) times unrolled cycles/element instructions/element 1 1.33 4.02 2 1.03 2.52 4 1.02 1.77 8 1.01 1.39 16 1.01 1.21 32 1.01 1.15 1.01 cycles/element latency bound