void twiddle1(int xp, int yp) { void twiddle2(int xp, int yp) {

Size: px

Start display at page:

Download "void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {"

Bartholomew Hill
5 years ago
Views:

1 Optimization

2 void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; void twiddle2(int *xp, int *yp) { *xp += 2* *yp;

3 void main() { int x = 3; int y = 3; twiddle1(&x, &y); x = 3; y = 3; twiddle2(&x, &y); x = 3; twiddle1(&x, &x); x = 3; twiddle2(&x, &x);

4 $ mv example.s example_unop.s $ gcc -O1 -S example.c $ mv example.s example_o1.s $ gcc -O2 -S example.c $ mv example.s example_o2.s $ gcc -O3 -S example.c $ mv example.s example_o3.s

5 wc example_* example_o1.s example_o2.s example_o3.s example_unop.s total

6 call movq call leaq leaq movq movq call movq call $.LC0, %edi puts -8(%rbp), %edx -4(%rbp), %ecx $.LC1, %eax %ecx, %esi %rax, %rdi $0, %eax printf -8(%rbp), %rdx -4(%rbp), %rax %rdx, %rsi %rax, %rdi twiddle1-8(%rbp), %edx -4(%rbp), %ecx $.LC2, %eax %ecx, %esi %rax, %rdi $0, %eax printf $3, -4(%rbp) $3, -8(%rbp) $.LC3, %edi $.LC0, %esi $1, %edi $0, %eax.cfi_offset 3, -24.cfi_offset 6, -16 call printf_chk 8(%rsp), %ecx 12(%rsp), %edx $.LC1, %esi $1, %edi $0, %eax call printf_chk leaq 8(%rsp), %rbp leaq 12(%rsp), %rbx movq %rbp, %rsi movq %rbx, %rdi call twiddle1 8(%rsp), %ecx 12(%rsp), %edx $.LC2, %esi

twiddle1 8(%rsp), %ecx 12(%rsp), %edx $.LC2, %esi xorl call xorl call xorl call $.

7 $.LC0, %esi $1, %edi $0, %eax.cfi_offset 3, -24.cfi_offset 6, -16 call printf_chk 8(%rsp), %ecx 12(%rsp), %edx $.LC1, %esi $1, %edi $0, %eax call printf_chk leaq 8(%rsp), %rbp leaq 12(%rsp), %rbx movq %rbp, %rsi movq %rbx, %rdi call twiddle1 8(%rsp), %ecx 12(%rsp), %edx $.LC2, %esi xorl call xorl call xorl call $.LC0, %esi $1, %edi %eax, %eax printf_chk $3, %ecx $3, %edx $.LC1, %esi $1, %edi %eax, %eax printf_chk $3, %ecx $9, %edx $.LC2, %esi $1, %edi %eax, %eax printf_chk

8 Side Effects Functions can modify values outside of their scope causing behavior that is hard to predict. int f() { return counter++; GCC does and OK job of dealing with these cases, but not great. This is why it makes sense to make things easier for GCC to optimize.

9 Code to sum a vector. Who is faster? void psum1(float a[], float p[], long int n) { long int i; p[0] = a[0]; for(i = 1; i < n; i++) p[i] = p[i-1] + a[i]; void psum2(float a[], float p[], long int n) { long int i; p[0] = a[0]; for(i = 1; i < n-1; i+=2) { float mid_val = p[i-1] + a[i]; p[i] = mid_val; p[i+1] = mid_val + a[i + 1]; if(i<n) p[i] = p[i-1] + a[i];

10 Cycles Code to sum a vector. Who is faster? psum1 Slope = 10.0 psum2 Slope = Elements

11 Combine (comments?) typedef struct { long int len; data_t *data; vec_rec, *vec_ptr; #define IDENT 0 #define OP + //#define IDENT 1 //#define OP * void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; int get_vec_element(vec_ptr v, long int index, data_t *dest ) { if (index < 0 index >= v->len) return 0; *dest = v->data[index];

12 Combine (comments?) void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O

13 Move length reducing calls void combine2(vec_ptr v, data_t *dest) { long int i; long int len = vec_length(v); *dest = IDENT; for (i = 0; i < len; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length

14 Eliminating Calls void combine3(vec_ptr v, data_t *dest) { long int i; long int len = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < len; i++) { *dest = *dest OP data[i]; Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length combine3 direct access

15 Accumulate in local. void combine4(vec_ptr v, data_t *dest) { long int i; long int len = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for (i = 0; i < len; i++) { acc = acc OP data[i]; *dest = acc; Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length combine3 direct access combine 4 accum in local

16 Obvious Optimizations. Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length combine3 direct access combine 4 accum in local

17 Non-obvious Optimizations. Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length combine3 direct access combine 4 accum in local

18 Machine Specific Optimizations. Integer Floating Point Function Method + * + F* D* combine1 Unoptimized combine1 -O combine2 Move length combine3 direct access combine 4 accum in local

19 Modern CPU Design Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Operations Address Instructions Instruction Cache Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Data Cache Execution

20 Superscalar Processor Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have Most CPUs since about 1998 are superscalar. Intel: since Pentium Pro

21 Pentium 4 Nocona CPU Multiple instructions can execute in parallel 1 load, with address computation 1 store, with address computation 2 simple integer (one may be branch) 1 complex integer (multiply/divide) 1 FP/SSE3 unit 1 FP move (does all conversions) Some instructions take > 1 cycle, but can be pipelined Instruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46

22 Latency versus Throughput Last slide: latency cycles/issue Integer Multiply 10 1 Step 1 1 cycle Step 2 1 cycle Step 10 1 cycle Consequence: How fast can 10 independent int mults be executed? t1 = t2*t3; t4 = t5*t6; How fast can 10 sequentially dependent int mults be executed? t1 = t2*t3; t4 = t5*t1; t6 = t7*t4; Major problem for fast execution: Keep pipelines filled

23 Hard Bounds Latency and throughput of instructions Instruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46 How many cycles at least if Function requires n int mults? Function requires n float adds? Function requires n float ops (adds and mults)?

24 Nocona vs. Core 2 Nocona (3.2 GHz) Instruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46 Core 2 (2.7 GHz) (Recent Intel microprocessors) Instruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 3 1 Integer/Long Divide 18/50 18/50 Single/Double FP Multiply 4/5 1 Single/Double FP Add 3 1 Single/Double FP Divide 18/32 18/32

25 Instruction Control Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Operations Address Instructions Instruction Cache Grabs instruction bytes from memory Based on current PC + predicted targets for predicted branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates instructions into micro-operations (for CISC style CPUs) Micro-op = primitive step required to perform instruction Typical instruction requires 1 3 operations Converts register references into tags Abstract identifier linking destination of one operation with sources of later operations

26 Translating into Micro-Operations imulq %rax, 8(%rbx,%rdx,4) Goal: Each operation utilizes single functional unit Requires: Load, integer arithmetic, store load 8(%rbx,%rdx,4) temp1 imulq %rax, temp1 temp2 store temp2, 8(%rbx,%rdx,4) Exact form and format of operations is trade secret

27 Traditional View of Instruction Execution addq %rax, %rbx andq %rbx, %rdx mulq %rcx, %rbx xorq %rbx, %rdi # I1 # I2 # I3 # I4 I1 rax + rbx rdx rcx rdi I2 & I3 * Imperative View I4 Registers are fixed storage locations Individual instructions read & write them Instructions must be executed in specified sequence to guarantee proper program behavior ^

28 Dataflow View of Instruction Execution addq %rax, %rbx andq %rbx, %rdx mulq %rcx, %rbx xorq %rbx, %rdi # I1 # I2 # I3 # I4 I1 rax.0 + rbx.0 rdx.0 rcx.0 rdi.0 rbx.1 I2/I3 * & rbx.2 rdx.1 I4 ^ Functional View View each write as creating new instance of value Operations can be performed as soon as operands available No need to execute in original sequence rdi.1

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Modern CPU Design Instruction Control Retirement Unit Register File