Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris
Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge to scratch array ~% execution time Copy back to input array
Merge sort, dual 6MB input (-bit integers) Recurse(left) Recurse(right) Merge to scratch array Copy back to input array
Throughput per / elements per ms T700 dual- laptop 00 00 000 980 960 90 90 900 880 - - 0 000 000 000 000 000 Wall-clock execution time / ms
Throughput per / elements per ms AMD Phenom - 600 00 00-00 00-00 000 0 00 000 00 000 00 000 00 000 Wall-clock execution time / ms
The power wall Moore s law: area of transistor is about 0% smaller each generation A given chip area accommodates twice as many transistors P dyn = α f V Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits P leak = V(I sub + I ox ) Reduce I sub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce I ox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)
Operations / µs The memory wall 0000 CPU 000 00 0 MHz CPU clock, 00ns to access memory in 98 Memory
Operations / µs The memory wall 00 000 00 CPU 000 00 000 00 000 Cycles per memory access: 00-00 00 0 Memory
The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns
Power wall + ILP wall + memory wall = brick wall Power wall means we can t just clock processors faster any longer Memory wall means that many workload s perf is dominated by memory access times ILP wall means we can t find extra work to keep functional units busy while waiting for memory accesses
Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware
Amdahl s law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi- hardware. On a machine with n s, how many s do you need to use to get a x speed-up on the overall algorithm?
Speedup Amdahl s law, f=70%..0..0..0..0 0. 0.0 Desired x speedup Speedup achieved (perfect scaling on 70%) 6 7 8 9 0 6 #s
Amdahl s law, f=70% MaxSpeedup(f, c) = f + f c f = fraction of code speedup applies to c = number of s used
Speedup Amdahl s law, f=70%..0..0..0..0 0. 0.0 Desired x speedup Speedup achieved (perfect scaling on 70%) Limit as c = /(-f) =. 6 7 8 9 0 6 #s
Speedup Amdahl s law, f=0%..0.08.06.0.0.00 0.98 0.96 0.9 Speedup achieved with perfect scaling Amdahl s law limit, just.x 6 7 8 9 0 6 #s
7 9 7 9 6 67 7 79 8 9 97 0 09 7 Speedup Amdahl s law, f=98% 60 0 0 0 0 0 0 #s
Amdahl s law & multi- Suppose that the same h/w budget (space or power) can make us: 6 7 8 9 0 6
Core perf (relative to big Perf of big & small s..0 Assumption: perf = α resource 0.8 0.6 0. Total perf: 6 * / = Total perf: * = 0. 0.0 /6 /8 / / Resources dedicated to
Perf (relative to big ) Amdahl s law, f=98%..0..0 medium 6 small. big.0 0. 0.0 6 7 8 9 0 6 #Cores
Perf (relative to big ) Amdahl s law, f=7%..0 big medium 0.8 0.6 6 small 0. 0. 0.0 6 7 8 9 0 6 #Cores
Perf (relative to big ) Amdahl s law, f=%..0 0.8 0.6 big medium 0. 6 small 0. 0.0 6 7 8 9 0 6 #Cores
Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware
Asymmetric chips 7 8 9 0 6
Perf (relative to big ) Amdahl s law, f=7%.6.. 0.8 0.6 0. 0. 0 + medium big 6 small 6 7 8 9 0 6 #Cores
Perf (relative to big ) Amdahl s law, f=%. 0.8 0.6 big medium 0. 0. 0 + 6 small 6 7 8 9 0 6 #Cores
Perf (relative to big ) Amdahl s law, f=98%.. medium + 6 small. big 0. 0 6 7 8 9 0 6 #Cores
Speedup (relative to big ) Amdahl s law, f=98% 9 8 7 6 0 +9 6 small #Cores
Speedup (relative to big ) Amdahl s law, f=98% 9 8 7 6 0 +9 6 small Leave larger idle in parallel section #Cores
Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware
Merge sort (-)
Merge sort (-)
Merge sort (8-)
T (span): critical path length
T (work): time to run sequentially
T serial : optimized sequential code
A good multi- parallel algorithm T /T serial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T grows slowly with the problem size We tackle bigger problems by using more s, not by running for longer
Quick-sort on good input Recurse T = O(n logn) T = O(n) Sequential partition Sequential append Recurse
Quick-sort on good input T = O(n logn) T = O(n) Parallel partition Recurse Recurse Parallel append
Quick-sort on good input Recurse T = O(n log n) T = O(log n) Parallel partition Parallel append Recurse
Execution time Scheduling from a DAG 0 T = 0 T p T / P 8 6 T p T T = 0 6 7 8 9 0 # s
Speedup (T p / T ) Scheduling from a DAG 0 8 6 T p T 0 T p T / P 6 7 8 9 0 # s
Speedup (T p / T ) In CILK: T p T / p + c T 0 8 6 T p T / P Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low T p T c=. 0 6 7 8 9 0 # s
Speedup (T p / T ) In CILK: T p T / p + c T 0 8 Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low 6 T p T / P T p T c= c=. 0 6 7 8 9 0 # s
Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware
Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads in a workload with: Poor spatial locality Frequent memory accesses Main memory
Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads with synergistic resource needs Main memory
Multi- h/w common L Core Core ALU ALU ALU ALU L cache L cache L cache Main memory
Multi- h/w common L Read L cache L cache L cache Main memory
Multi- h/w common L Read L cache L cache L cache S Main memory
Multi- h/w common L Read L S cache L cache L cache S Main memory
Multi- h/w common L Read L S cache L cache L cache S Main memory
Multi- h/w common L Read L S cache L S cache L cache S Main memory
Multi- h/w common L Write L S cache L S cache L cache S Main memory
Multi- h/w common L Write L cache L S cache L cache S Main memory
Multi- h/w common L Write L cache L S cache L cache Main memory
Multi- h/w common L Write L cache L I cache L cache Main memory
Multi- h/w common L Write L M cache L I cache L cache E Main memory
Multi- h/w common L Write L M cache L I cache L cache E Main memory
Multi- h/w common L Write L M cache L cache L cache E Main memory
Multi- h/w common L Write L M cache L cache L cache Main memory
Multi- h/w common L Write L I cache L cache L M cache Main memory
Multi- h/w common L Write L I cache L E cache L M cache Main memory
Multi- h/w separate L L cache L cache L cache L cache Main memory
Multi- h/w additional L L cache L cache L cache L cache L cache Main memory
Multi-threaded multi- h/w L cache L cache L cache L cache L cache Main memory
SMP multiprocessor L cache L cache L cache L cache Main memory
NUMA multiprocessor L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory Interconnect L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory
Three kinds of parallel hardware Multi-threaded s Increase utilization of a or memory b/w Peak ops/cycle fixed Multiple s Increase ops/cycle Don t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately
Throughput per / elements per ms AMD Phenom 600 00 00-00 00-00 000 0 00 000 00 000 00 000 00 000 Wall-clock execution time / ms