. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

Size: px

Start display at page:

Download ". Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo"

Kevin Holt
6 years ago
Views:

1 .. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45

2 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45

3 Last week we have seen OpenMP and TBB, two shared memory programming models data are (or can be made) shared among concurrent activities (i.e. values assigned by one are visible to others) int a[n]; # pragma omp parallel for for (i = 0; i < n; i ++) a[i ] =...;... = a[i]; int a[n]; parallel_for (0, n, [=,&a] { a[i] =...; });... = a[i]; 3 / 45

4 Shared memory on shared memory machines, an assignment like a[i] =... is done by a mere store instruction no extra instructions are necessary to move data from a thread to another efficient and easy to program a[0], a[1], a[2],..., a[n-1] a[0], a[1], a[2],..., a[n-1] 4 / 45

5 So, is communication free? No How does its cost manifest? ANSWER: extra cache misses 5 / 45

6 Today s goals learn reasons why shared memory parallel programs may not speedup perfectly some artificial bottnelecks you must be aware unavoidable/essential reasons 6 / 45

7 Some artificial bottnelecks you must be aware libraries that serialize page faults (demand paging) threads on wrong cores C++ constructors 7 / 45

8 Libraries that serialize Linux s standard malloc function scales very poorly C++ new is similar (ends up calling malloc) let s quantify its performance # pragma omp parallel for for (i = 0; i < n; i ++) { a[i] = malloc (64) ; } parallel_for (0, n, [=] ( int i) { a[i] = malloc (64) ; }); 8 / 45

9 Experiments Experiments in this slide were conducted on a server with Intel Xeon E7540 (Nehalem) processor 4 chips, 24 physical cores, 48 hardware threads 9 / 45

10 Scalability of standard malloc y-axis : number of total malloc(64) calls per second 64 byte allocs/sec 1.2e+07 1e+07 8e+06 6e+06 4e+06 2e+06 iteration 0 omp scalable=0 tbb scalable= number of threads 10 / 45

11 TBB s scalabile malloc (7.3) TBB supports a replacement scalable allocater # include <tbb/scalable allocator.h>... tbb :: parallel_for (0, n, [=] ( int i) { a[i] = tbb::scalable malloc(b); }); tbb :: parallel_for (0, n, [=] ( int i) { tbb::scalable free(a[i]); }); 11 / 45

12 Scalability of TBB s scalable malloc 64 byte allocs/sec 4e e+07 3e e+07 2e e+07 1e+07 5e+06 0 iteration 0 omp scalable=0 tbb scalable=0 tbb scalable= number of threads 12 / 45

13 One more caveat much better when you reuse memory once freed for (i = 0; i < n; i ++) a[i] = malloc (64) ; for (i = 0; i < n; i ++) free (a[i]); for (i = 0; i < n; i++) a[i] = malloc(64); 64 byte allocs/sec 4e e+08 3e e+08 2e e+08 1e+08 5e+07 0 iteration 1 omp scalable=0 tbb scalable=0 tbb scalable= / 45

14 What you should do with malloc? avoid malloc or move it outside parallel region go ahead if it s easy, but it s not always you may not know how much you need (the very reason you need malloc) other libraries may call malloc (e.g. strdup) any dynamic memory allocation may have a similar problem (e.g. std::vector) in real life, it may be OK to be slow in the first iteration bottomline: know what you are doing/measuring 14 / 45

15 Page faults (demand paging) you experience page fault when you touch a page for the first time in this process note: malloc may return pages that have never been touched, or pages returned from previous free page fault handling in Linux OS appears not very scalable OpenMP: int * a = ( int *) malloc ( sizeof ( int ) * n); # pragma omp parallel for for (i = 0; i < n; i++) a[i] = i * x; TBB: int * a = ( int *) malloc ( sizeof ( int ) * n); parallel for(0, n, [=] (int i) { a[i] = i * x; } 15 / 45

16 Throughput of page faults we are seeing throughput of page fault (demand paging) handling of OS MB/sec write iteration 0 omp 16.0 MB omp MB omp MB tbb 16.0 MB tbb MB tbb MB / 45

17 Throughput of writes completely different performance for the second time we are seeing genuine write throughput of the machine OpenMP: int * a = ( int *) malloc ( sizeof ( int ) * n); # pragma omp parallel for for (i = 0; i < n; i ++) a[i] = i * x; # pragma omp parallel for for (i = 0; i < n; i++) a[i] = i * x; TBB: int * a = ( int *) malloc ( sizeof ( int ) * n); parallel_for (0, n, [=] ( int i) a[i] = i * x; parallel for(0, n, [=] (int i) a[i] = i * x; 17 / 45

18 Throughput of writes MB/sec write iteration omp 16.0 MB omp MB omp MB tbb 16.0 MB tbb MB tbb MB / 45

19 What you should do with page faults? you cannot avoid one page fault per page blame OS or demand paging? please contribute before complaining (it exists for good reasons; I bet it won t be easy) a similar conclusion with the malloc: know what you are doing/measuring 19 / 45

20 Threads on wrong cores when you use p underlying threads (e.g. set OMP NUM THREADS=p for OpenMP or task scheduler init(p) in TBB), you probably want to use p CPU cores when you launch p threads and p < the number of CPU cores, OS is generally smart enough to run each thread a distinct CPU core. it depends on OS and language implementation, however. 20 / 45

21 Threads on wrong cores things to worry about OS may not be very quick language may or may not pin each thread on a separate core in hyperthreaded CPUs, some virtual cores (aka hardware threads) may share a physical core and OS/language may not be careful enough on Linux, you may check where threads are running by sched getcpu() system call you may control where threads can run by sched setaffinity() system call or taskset/numactl commands 21 / 45

22 C++ constructors say we have the following C++ program T a[n]; T * b = new T[n]; std :: vector <T> v(n); each of the above calls T s constructor T() n times, when it is defined... and no way to parallelize it! C doesn t provide such automatic initialization initialization is on us, so is parallelization! 22 / 45

23 Shared memory let s turn our attention to more fundamental reasons, which stem from unavoidable communication due to parallelization what is the cost of communication between threads? a[0], a[1], a[2],..., a[n-1] a[0], a[1], a[2],..., a[n-1] 23 / 45

written by another processor, cache miss is unavoidable an abstract model: a[5] = 3.

24 Communication cache miss each processor maintains its own cache memory faster and smaller than main memory many memory accesses hit cache and do not access main memory if a processor reads a value written by another processor, cache miss is unavoidable an abstract model: a[5] = 3.14 p = a[5] a[5] memory a more accurate model: hit! a[5] = 3.14 hit! p = a[5] miss! memory q = a[5] q = a[5] 3.14! 3.14! 24 / 45

25 Measuring the cost of communication n workers alternately updates the same variable shared memory version of ping-pong benchmark next = worker_id ; # 0,1,2,... while (1) { while (* a < next ) ; # wait for my turn * a = next + 1; # my turn, update it next += n; } whenever it gets a new value, it must experience a cache miss 25 / 45

26 Cost of communication 1 workers: 2 ns / update 2 workers: worker 0 on CPU #0 CPU for worker 1 latency ns 4,8,12,16,20,28,32,36,40, ns others 500 ns 48 workers: 1500 ns we will explain the results of 2 workers case later for now, notice the striking difference between 1 worker (no cache misses) and 2 workers (cache miss/update) 26 / 45

27 ABC s of caches memory far smaller than main memory (e.g. 16KB, 2MB, 20MB) the cache is divided into fixed size lines each line holds data at a consecutive address range data move between caches and memory in the unit of a single line a processor generally wants to keep recently used lines in caches and evict not recently used lines to memory (LRU replacement policy) true LRU, which maintains which line is oldest, is costly, so caches typically approximate LRU 64 bytes cache line cache line 27 / 45

28 Associativity of caches full associative: data can occupy any line in the cache direct map: data have one designated seat (set), determined by its address K-way set associative: data have K designated seats, determined by its address direct map 1-way set associative full associative -way set associative 28 / 45

29 Cache organization example 512KB, 4-way set associative cache, of line size 64 bytes 2K sets 4 lines/set 64 bytes/line = 512KB given address to bring into the cache take its 6-16 bits (11 bits) to determine which set it can occupy check if it s in the cache if missed, evict the oldest one from the 4 lines in the set address 64 bytes cache line set(6:16) 8K lines = 2K sets x 4 lines/set 29 / 45

30 Four reasons of cache misses when you access x and miss the cache, it is either: compulsory: x is accessed for the first time capacity: x has not been accessed for a long time conflict: x has been evicted by data occupying the same set communication: (read miss) x has been modified (invalidated) by another CPU (write miss) x has been read (shared) by another CPU and you are trying to modify it 30 / 45

31 A more detailed picture of microprocessor Today s typical microprocessor hardware threads core chip hierarchy of caches (L1/L2/L3) L1/L2 shared inside a core (among hwts) L3/main memory shared inside a chip (among cores) (physical) core chip (socket, node, CPU) hardware thread (virtual core, CPU) L1 cache L2 cache memory controller L3 cache 31 / 45

32 A more detailed picture of a single box hardware thread (virtual core, CPU) It consists of multiple such chips (physical) core chip (socket, node, CPU) L1 cache L2 cache memory controller L3 cache interconnect 32 / 45

33 An example node organization Intel Xeon E7540 (Nehalem) processor 2 hardware threads/core 6 cores/chip 4 chips/box caches Level line size capacity way private/shared L KB 8 private to core L KB 8 private to core L MB 24 shared among cores 33 / 45

34 Implications the cost of a cache miss depends on who is serving the line L2 cache L3 cache main memory attached to the same chip (local memory) main memory or cache in other chips (remote memory) when you spill data to main memory, you are competing with other cores on the same chip 34 / 45

35 A general principle bring data in cache, compute as much as possible with that data important quantity: bytes-per-flops = how much data (bytes) you need to do a unit computation (flops) or its reciprocal: compute data A general high level principle: keep your compute/data ratio high 35 / 45

36 Compute/data ratio : why important? it generally gives you how many times the algorithm can potentially reuse the same data if it s small, only so much data reuse large computation will need accordingly large data accordingly many cache misses if it s large, you may be able to do lots of computation without much memory accesses; such algorithm is efficient even in serial efficiently decomposable into threads 36 / 45

37 Compute/data ratio : example compute data compute/data graph traversal O( V + E ) O( V + E ) O(1) k-means (1 step) O(nk) O(n + k) O(k) dense mm O(n 3 ) O(n 2 ) O(n) 37 / 45

38 Our running example Problem:. 1 you are given an array double a[n]. 2 version B: compute 3. compute : O(n 2 ) 4. data : O(n) 5. compute/data : O(n) min (a[i] a[j])2 0 i<j<n 38 / 45

39 Notes on setting to illustrate the point, a single element is artificially made 128 bytes in essence, we are simulating a machine of very small caches (1/16 of the real size) 39 / 45

40 Serial loop double min_ distance_ simple ( point * a, int n) { double min_ d2 = 100; int i, j; for (i = 0; i < n; i ++) { for (j = i + 1; j < n; j ++) { double d = a[i].x - a[j].x; double d2 = d * d; if (d2 < min_d2 ) min_d2 = d2; } } return min_ d2 ; } 40 / 45

41 compute/data : O(n) it means the entire algorithm reuses the same data many (O(n)) times yet, the access pattern of the doubly nested loop does not reuse data efficiently if data cache, cache miss on each line problem: how to do compute, retaining its O(data size) compute/data ratio recursion / 45

42 Recursive decomposition of computation (illustration) recursive naive loop 42 / 45

43 Recursive decomposition of computation (code) in essence double min_ distance_ rec ( point * a, int m, point * b, int n) { if (m is small ) { // a similar loop over a [0: m] and b [0: n] return min_ d2 ; } else { min_distance_rec (a, m/2, b, n /2) ; min_distance_rec (a, m/2, b+n/2, n-n /2) ; min_distance_rec (a+m/2, m-m/2, b, n /2) ; min_distance_rec (a+m/2, m-m/2, b+n/2, n-n /2) ; } } when (a == b) we omit the second or the third recursion 43 / 45

44 Scalability M pair/sec 128 bytes/element, 50K elements 6.4MB (does not fit in L2 cache, fit in L3 cache) three systems: OpenMP, TBB, and MassiveThreads (interface is identical to TBB) throughput mthtask omp tbb / 45

45 Results recursion is twice as fast as loop both MassiveThreads and OpenMP scaled well OpenMP: L3 cache may be enough to sustain observed performance (need to quantify) TBB scaled poorly, which will be some artifacts we have not yet uncovered 45 / 45

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues