Martin Kruliš, v

Size: px

Start display at page:

Download "Martin Kruliš, v"

Claud Hutchinson
5 years ago
Views:

1 Martin Kruliš 1

2 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2

3 Premature optimization is the root of all evil. -- D. Knuth Our goal is to find possible bottlenecks Rest of the system does not require optimizations Optimization has many levels Network traffic is more expensive than local storage HDD is much slower than RAM Today s Agenda RAM is much slower than CPU CPU speed depends on many details of your code 3

4 (Still) Too Slow? The Optimization Cycle Run the profiler Identify slowest part of code Try to optimize Chose better algorithm Evaluate Optimize Write better code/modify compiler configuration Redesign your memory access patterns Consider parallelization Re-run the profiler Evaluate impact of your optimizations Start over again Identify Profile 4

5 CPUs of The Day Very complex (especially instruction scheduling) Difficult to write effective code by hand Compilers of The Day Usually better than programmers Sometimes they need assistance Neat code Compiler hints Compiler configuration Inline assembler 5

6 Procedural And Object-based Languages Compiled Languages C/C++ - best choice for HPC so far Java/C# - slower, two stage compilation, GC issues Pascal, Fortran, - obsolete Interpreted Languages Perl, Python, PHP, Too slow for high performance applications Non-procedural Languages Only for very specific purposes 6

7 Choosing The Compiler MSVC, GCC, ICC, ICC usually slightly better than the others Proper Configuration Use all switches that turns optimizations on (-O2) Turn off debugging symbols and features Optimize for target architecture Activate SSE instructions, if possible Locally activate optimizations for certain functions #pragma optimize 7

8 Pipeline Concept Each instruction is executed in many steps Decode, fetch operands, execute, write results, Some of these steps may be processed concurrently I 1 decode fetch execute write I 2 decode fetch execute write I 3 decode fetch execute write I 4 decode fetch execute write I 5 decode fetch execute write 8

9 Branching Problem In conditional jump or loop instructions The CPU does not know which instruction to load next until the jump instruction is executed The pipeline is getting stalled Solutions CPU level branch prediction The processor tries to guess the right branch Code level avoiding branches Sometimes we can use other instruction instead 9

10 Loop Unrolling Repeat code inside a loop Reduce number of iterations, thus the loop overhead Facilitates principles of a pipeline Emphasizes operation independence (if any) The compiler attempts to unroll loops automatically Sometimes requires a hint or an option flag for (int i = 0; i < N; ++i) { fnc(data[i]); } for (int i = 0; i < N; i += 4) { fnc(data[i]); fnc(data[i+1]); fnc(data[i+2]); fnc(data[i+3]); } 10

11 Tricks With Bits Some problems may be solved by a sequence of elaborate binary tricks instead of direct algorithm Many of them are well known and documented E.g. finding nearest greater power of 2: --x; unsigned int res = 1; while (x > 0) { x = x >> 1; res = res << 1; } return res; --x; x = x >> 1; x = x >> 2; x = x >> 4; x = x >> 8; x = x >> 16; return ++x; bithack demo 11

12 Speculative Execution Instructions need not to be executed in the exactly same order as programmed Multiple instructions are executed simultaneously Implications Loosening data dependencies a = b + c d e + f g Considering CPU units occupancy Interleaving different types of instructions tmp 1 = b+c tmp 2 = e+f tmp 1 = tmp 1 *d tmp 2 = tmp 2 *g a = tmp 1 -tmp 2 12

13 Function Call Non-trivial cost (saving registers, stack frames, ) Calling Conventions There are many different calling conventions They define where the arguments are, where the result will be, who is responsible for the clean-up, fastcall (not standardized, depends on compiler) Usually tries to put as much as possible into registers Inline Functions Function being copied to the code instead of invoked Keyword inline hints the compiler to inline the function Avoid recursion in inline functions 13

14 Including Assembler Most of C/C++ dialects allows to include portions of inline asm code Each compiler has its own dialect asm ( "bswap %%eax;" : "=a"(idx) : "a"(idx) ); GCC inline assembler Problems Your assembly code is not optimized by the compiler It is difficult to optimize code around your asm block Compiler usually generates better code than human The platform (and compiler) independence is lost 14

Caches Reduce latency of memory operations Works with cache lines Aligned memory blocks of 64B Usually 2 or 3 levels The closer to CPU the smaller

15 Caches Reduce latency of memory operations Works with cache lines Aligned memory blocks of 64B Usually 2 or 3 levels The closer to CPU the smaller and faster Associativity Direct mapped n-way set associative Fully associative CPU has data prefetching techniques L3 cache L2 cache L1 data L1 instr. 15

16 L3 spills out L2 spills out L1 spills out 16

17 Virtual Address Translation Each process has its own virtual memory address space OS employs paging mechanism to map memory Virtual memory Physical memory page address offset 4kB Page Address Translation Frame 4kB 17

18 Page Tables Address Translation Mechanism used by IA32 (x86) Each translation requires additional 2 memory reads virtual address 10 bits 10 bits 12 bits CR3 PD entry table lv. 1 PD entry + physical address table lv. 2 18

19 Virtual Address Translation Very expensive Even worse in case of 64-bit applications 64-bit mode uses 4 layers of page tables Translation Lookup Buffer Associative cache for address translation Two levels, TLB L1 size is ~ tens of records We should restrict number of pages that are being used simultaneously to minimize address translation impact 19

20 Motivation Single core vs. dual core 20

21 Instruction Level SSE instruction (many versions, depending on architecture) Processor has special 128bit registers Which can contain 2 doubles, 4 floats, 4 integers, 16 bytes, SSE instructions process every item in the register Much faster operations on float numbers than FPU SSE instruction can be generated Automatically by the compiler By hand in asm block, or using special libraries (xmmintrin.h) float float float float xmm0 SSE demo addps float float float float xmm1 21

22 Threads The only way how to exploit multiple CPU cores The OS plans threads to available cores Work Distribution 1 job ~ 1 thread Overhead of starting and disposing a thread Thread pool As many threads as available cores Jobs are distributed between threads Statically Dynamically (task stealing) 22

23 Task Stealing Each thread has its own task queue Task are executed non-preemptively Tasks must have appropriate size ~ 10,000 instructions When a thread runs out of tasks, it steals a task from a random victim task queues thread pool 23

24 Thread Issues We have little control over thread scheduling Work distribution has overheads Synchronization might be required Amdahl s Law speedup = 1 1 P + P N P ~ parallelizable part, N ~ #CPUs If only 90% of the alg. is parallelizable, we can never achieve better speedup than 10x 24

25 Synchronization Implicit synchronization We avoid race condition by careful job scheduling Atomic operations Simple operations (inc, add, cmpxchg, ) Ensured by the architecture Synchronization primitives Much slower Mutual exclusion (mutex, semaphore, ) Barrier 25

26 Keeping Memory Consistent Data in cache may not reflect the main memory if multiple cores are allowed to write in it MESI Protocol Each processor core employs memory bus snooping Each cache line has one of following states Modified, Exclusive, Shared, Invalid Sometimes extended with Owned, Forward, or Recent states 26

27 False Sharing Cache stores data in cache lines (64B blocks) When two cores operate with two distinct variables located on the same cache line cache line variable 1 variable 2 64B core 1 core 2 27

28 Non-Uniform Memory Access/Architecture Each core has its own memory However, all cores can access all the memory Different parts of memory have different latencies Multiple cores sharing one memory bus 4x4 NUMA 28

29 Pthreads Low-level library (start/wait-for a thread) Basic synchronization primitives OpenMP C/C++ language extension for parallelization Set of precompiler pragma definitions Easy to use for simple cases: #pragma omp parallel for for (int i = 0; i < N; ++i) { 29

30 Threading Building Blocks C++ Open source parallelization library from Intel Task scheduler Manages thread pool Employs task stealing and nested task spawning Parallel templates Parallel-for, parallel-reduce, parallel-scan, Built on top of the scheduler Concurrent data structures Concurrent vector and queue with atomic expansion Concurrent hash map with fine grained locking 30

31 Profiler An application that measures program speed and some (hardware) events Identifies slowest parts of the program And usually provides us with reasons of the slowness Choose The Profiler Windows Intel VTune Amplifier Linux gprof periodically checks the profiled application oprofile uses HW monitoring through kernel VTune demo 31

32 32

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant