Multicore Programming

Size: px

Start display at page:

Download "Multicore Programming"

Clarence Tate
5 years ago
Views:

1 Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris

2 Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge to scratch array ~% execution time Copy back to input array

3 Merge sort, dual 6MB input (-bit integers) Recurse(left) Recurse(right) Merge to scratch array Copy back to input array

Throughput per / elements per ms T700 dual- laptop 00 00 000 980 960

4 Throughput per / elements per ms T700 dual- laptop Wall-clock execution time / ms

5 Throughput per / elements per ms AMD Phenom Wall-clock execution time / ms

6 The power wall Moore s law: area of transistor is about 0% smaller each generation A given chip area accommodates twice as many transistors P dyn = α f V Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits P leak = V(I sub + I ox ) Reduce I sub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce I ox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)

7 Operations / µs The memory wall 0000 CPU MHz CPU clock, 00ns to access memory in 98 Memory

8 Operations / µs The memory wall CPU Cycles per memory access: Memory

9 The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns

10 Power wall + ILP wall + memory wall = brick wall Power wall means we can t just clock processors faster any longer Memory wall means that many workload s perf is dominated by memory access times ILP wall means we can t find extra work to keep functional units busy while waiting for memory accesses

11 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

12 Amdahl s law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi- hardware. On a machine with n s, how many s do you need to use to get a x speed-up on the overall algorithm?

13 Speedup Amdahl s law, f=70% Desired x speedup Speedup achieved (perfect scaling on 70%) #s

14 Amdahl s law, f=70% MaxSpeedup(f, c) = f + f c f = fraction of code speedup applies to c = number of s used

15 Speedup Amdahl s law, f=70% Desired x speedup Speedup achieved (perfect scaling on 70%) Limit as c = /(-f) = #s

Speedup Amdahl s law, f=0%..0.08.06.0.0.00 0.98 0.96 0.

16 Speedup Amdahl s law, f=0% Speedup achieved with perfect scaling Amdahl s law limit, just.x #s

17 Speedup Amdahl s law, f=98% #s

18 Amdahl s law & multi- Suppose that the same h/w budget (space or power) can make us:

Core perf (relative to big Perf of big & small s..0 Assumption: perf = α resource 0.

19 Core perf (relative to big Perf of big & small s..0 Assumption: perf = α resource Total perf: 6 * / = Total perf: * = /6 /8 / / Resources dedicated to

20 Perf (relative to big ) Amdahl s law, f=98% medium 6 small. big #Cores

21 Perf (relative to big ) Amdahl s law, f=7%..0 big medium small #Cores

22 Perf (relative to big ) Amdahl s law, f=% big medium 0. 6 small #Cores

23 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

24 Asymmetric chips

25 Perf (relative to big ) Amdahl s law, f=7% medium big 6 small #Cores

26 Perf (relative to big ) Amdahl s law, f=% big medium small #Cores

27 Perf (relative to big ) Amdahl s law, f=98%.. medium + 6 small. big #Cores

28 Speedup (relative to big ) Amdahl s law, f=98% small #Cores

Speedup (relative to big ) Amdahl s law, f=98% 9 8 7 6

29 Speedup (relative to big ) Amdahl s law, f=98% small Leave larger idle in parallel section #Cores

30 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

31 Merge sort (-)

32 Merge sort (-)

33 Merge sort (8-)

34 T (span): critical path length

35 T (work): time to run sequentially

36 T serial : optimized sequential code

37 A good multi- parallel algorithm T /T serial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T grows slowly with the problem size We tackle bigger problems by using more s, not by running for longer

38 Quick-sort on good input Recurse T = O(n logn) T = O(n) Sequential partition Sequential append Recurse

39 Quick-sort on good input T = O(n logn) T = O(n) Parallel partition Recurse Recurse Parallel append

40 Quick-sort on good input Recurse T = O(n log n) T = O(log n) Parallel partition Parallel append Recurse

41 Execution time Scheduling from a DAG 0 T = 0 T p T / P 8 6 T p T T = # s

42 Speedup (T p / T ) Scheduling from a DAG T p T 0 T p T / P # s

43 Speedup (T p / T ) In CILK: T p T / p + c T T p T / P Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low T p T c= # s

44 Speedup (T p / T ) In CILK: T p T / p + c T 0 8 Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low 6 T p T / P T p T c= c= # s

45 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

46 Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads in a workload with: Poor spatial locality Frequent memory accesses Main memory

47 Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads with synergistic resource needs Main memory

48 Multi- h/w common L Core Core ALU ALU ALU ALU L cache L cache L cache Main memory

49 Multi- h/w common L Read L cache L cache L cache Main memory

50 Multi- h/w common L Read L cache L cache L cache S Main memory

51 Multi- h/w common L Read L S cache L cache L cache S Main memory

52 Multi- h/w common L Read L S cache L cache L cache S Main memory

53 Multi- h/w common L Read L S cache L S cache L cache S Main memory

54 Multi- h/w common L Write L S cache L S cache L cache S Main memory

55 Multi- h/w common L Write L cache L S cache L cache S Main memory

56 Multi- h/w common L Write L cache L S cache L cache Main memory

57 Multi- h/w common L Write L cache L I cache L cache Main memory

58 Multi- h/w common L Write L M cache L I cache L cache E Main memory

59 Multi- h/w common L Write L M cache L I cache L cache E Main memory

60 Multi- h/w common L Write L M cache L cache L cache E Main memory

61 Multi- h/w common L Write L M cache L cache L cache Main memory

62 Multi- h/w common L Write L I cache L cache L M cache Main memory

63 Multi- h/w common L Write L I cache L E cache L M cache Main memory

64 Multi- h/w separate L L cache L cache L cache L cache Main memory

65 Multi- h/w additional L L cache L cache L cache L cache L cache Main memory

66 Multi-threaded multi- h/w L cache L cache L cache L cache L cache Main memory

67 SMP multiprocessor L cache L cache L cache L cache Main memory

68 NUMA multiprocessor L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory Interconnect L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory

69 Three kinds of parallel hardware Multi-threaded s Increase utilization of a or memory b/w Peak ops/cycle fixed Multiple s Increase ops/cycle Don t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately

Throughput per / elements per ms AMD Phenom 600 00 00-00 00-00

70 Throughput per / elements per ms AMD Phenom Wall-clock execution time / ms

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1