Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet

Size: px

Start display at page:

Download "Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet"

Stephen Hopkins
5 years ago
Views:

1 Multicore from an Application s Perspective Erik Hagersten Uppsala Universitet

2 Communication in an SMP A: B: Shared Memory $ $ $ Thread Thread Thread Read A Read A Read A... Read A Write A Read B Read A

3 Communication in a NUMA Wors A: Mem $ Thread Read A Read A Read A Interconnect ~3-5ns Mem $ Thread... Read A Write A

4 Communication in [some] Multicores Mem ~ns External I/F $ CPU $ CPU L2$ Mem I/F $ CPU $ CPU Simple fast CPU core + local caches read A write A t treads

5 Looks and Smells Like an SMP? SMP Multicore Memory Memory Interconnect L2 L2 32 L2 L L L CPU CPU CPU Well, how about: Cost of thread communication? Cache capacity per thread? L C L2 32 L C Memory bandwidth per thread? Gotta Optimize For Locality!

6 Criteria for HPC Algorithms Past: Minimize communication Maximize scalability (s of CPUs) Multicores today: Communication is for free [on some multicores] Scalability is limited to 32 threads The caches are tiny Memory bandwidth is scarse Data locality is key! (Both for Capacity and Capability Computing!)

7 Case Study: Gauss-Seidel on Multicores Erik Hagersten Sweden Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 26

8 Natural Order Gauss-Seidel = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

9 Natural Order Gauss-Seidel = sweep path IF (convergence_test) <done> else <iterate again> = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

10 Natural Order Gauss-Seidel = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout Data dependence Poor Parallelism

11 Red-Black Gauss-Seidel = sweep path,5,5,5,5,5 = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

12 Red-Black Gauss-Seidel step,5: update the blacks = sweep path,5,5,5,5,5 = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

13 Red-Black Gauss-Seidel step, update all reds = sweep path Update all blacks <barrier> Update all reds <barrier> = previous = current = data dependence,2,3,4 = iteration number = cacheline layout great parallelism!!!

14 Red-Black Gauss-Seidel Parallel version = sweep path thread thread thread 2 thread 3 thread 4 IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> } = previous = current = data dependence,2,3,4 = iteration number = cacheline layout

15 Any Drawbacks of the Red-Black? Poor Cache Locality of Red-Black: Each element will be brought into the cache twice per iteration Natural Order: Each element will be brought into the cache once per iteration You can do even better Natural Order with Temporal Blocking

16 G-S, temporal blocking: several iterations per sweep = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region

17 G-S, temporal blocking: several iterations per sweep = sweep path = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region In this case: 4 iterations per sweep. (σ = 4) σ =, for natural order G-S σ =,5 for red-black G-S

18 G-S 3D, σ=2

19 G-S 3D, σ=2

20 Acumem Graph, 3D N=29 Cache miss ratio (percent) MB MB 52 KB σ =,5 σ = σ = 2 σ = 4 σ = 8 σ = 6 32 MB 6 MB 8 MB 4 MB Cache size Miss ratio ~ Memory bandwidth

21 Parallel G-S, temporal blocked thread thread thread 2 thread 3 = sweep path Synchronization flags = previous = current = data dependence,2,3,4 = iteration number = cacheline layout = active region = sync flag iteration no Wait until lefty is done: Lots of communication Producer/Consumer flag Sharing of data values

22 Parallel G-S 3D t t t2 t3

23 Parallel G-S 3D cacheline layout (size B bytes) t t t2 t3 2 Stratup cost = (#threads-)/(nσ) Communication: one flag synchronization per N 2 /#threads ops one communication miss per B*N/#threads bytes

24 Parallel Executiontime Execution time per step (sec) 2,5 2,5,5 σ= threads= threads=4 threads=2 threads=8, RedBlack RBGS Natural Order Temp.Blocked TBGS GS

25 Performance comparison with Red-Black σ =,5 N = 257, 32 threads (Sun E5 K, US IIIcu = SMP!!) Performance ratio TBGS/RBGS 3,5 3, 2,5 2,,5, σ = σ = 2 σ = 4 σ = 8 σ = Threads

26 Multicore Simulation σ = Performance ratio TBGS/RBGS 2,5 2,,5,,5, Sun 5k Simulated Multicore Threads

27 Using Gauss-Seidel Smoother in a Multigrid G-S Important part of many real apps! EX: G-S as a Smoother in Multigrid Iterative algorithm More efficient smother cuts #iterations

28 One slide summary Today s algorithms assume expensive communication The communication cost of [some] multicores is close to zero Locality is becoming key to performance [again] Redesign HPC algorithms to face this fact! (For both Capacity and Capability computing) We show: * 3x performance gain * ~3x less bandwidth Is it time to revisit more algorithms?

CPUs vs. Memory Hierarchies

CPUs vs. Memory Hierarchies CPUs vs. ory Hierarchies eller hur mår Moores lag Erik Hagersten Uppsala Universitet CPU Improvements Moore s Law 2x performance improvm./18 mo (Gordon Moore actually never said this ) Relative Performance