Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Size: px

Start display at page:

Download "Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris"

Eugene Hines
5 years ago
Views:

Electrical and Computer Engineering National Technical University of Athens

1 Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens {anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr May 29, 29 CSLab National Technical University of Athens

2 Outline 1 Dijkstra s Basics 2 Straightforward Parallelization Scheme 3 Helper-Threading Scheme 4 Experimental Evaluation 5 Conclusions Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 2 / 27

3 The Basics of Dijkstra s Algorithm SSSP Problem Directed graph G = (V, E), weight function w : E R +, source vertex s v V : compute δ(v) = min{w(p) : s p v} Shortest path estimate d(v) gradually converges to δ(v) through relaxations relax (v,w): d(w) = min{d(w), d(v) + w(v, w)} can we find a better path s p w by going through v? Three partitions of vertices Settled: d(v) = δ(v) Queued: d(v) > δ(v) and d(v) Unreached: d(v) = Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 3 / 27

4 The Basics of Dijkstra s Algorithm Serial algorithm Input : G = (V, E), w : E R +, source vertex s, min Q Output : shortest distance array d, predecessor array π foreach v V do d[v] inf; π[v] nil; Insert(Q, v); end d[s] ; while Q do u ExtractMin(Q); foreach v adjacent to u do sum d[u] + w(u, v); if d[v] > sum then DecreaseKey(Q, v, sum); d[v] sum; π[v] u; end end 55 B A C D S E 7 Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 4 / 27

5 The Basics of Dijkstra s Algorithm j i k i: Min-priority queue implemented as binary min-heap maintains all but the settled vertices min-heap property: i : d(parent(i)) d(i) amortizes the cost of multiple ExtractMin s and DecreaseKey s O(( E + V )log V ) time complexity Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 5 / 27

6 Straightforward Parallelization Fine-grain parallelization at the inner loop level Fine-Grain Multi-Threaded S /* Initialization phase same to the serial code */ while Q do Barrier if tid = then u ExtractMin(Q); Barrier for v adjacent to u in parallel do sum d[u] + w(u, v); if d[v] > sum then Begin-Atomic DecreaseKey(Q, v, sum); End-Atomic d[v] sum; π[v] u; end end Issues: 55 B A C D E speedup bounded by average out-degree concurrent heap updates due to DecreaseKey s barrier synchronization overhead Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 6 / 27

7 Concurrent Heap Updates with Locks x 3 x j j j: i k i:17 6 k: i: k Coarse-grain synchronization (cgs-lock) enforces atomicity at the level of a DecreaseKey operation one lock for the entire heap serializes DecreaseKey s Fine-grain synchronization (fgs-lock) enforces atomicity at the level of a single swap allows multiple swap sequences to execute in parallel as long as they are temporally non-overlapping separate locks for each parent-child pair Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 7 / 27

8 Performance of FGMT with Locks Multithreaded speedup cgs-lock perfbar+cgs-lock perfbar+fgs-lock Software barriers dominate total execution time 72% with 2 threads, % with replace with idealized (simulated) zero-latency barriers Fgs-lock scheme more scalable, but still fails to outperform serial locking overhead (2 locks + 2 unlocks per swap) Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 / 27

9 Concurrent Heap Updates with TM x 3 x j j j: i k i:17 6 k: i: k TM-based Coarse-grain synchronization (cgs-tm) enclose DecreaseKey within a transaction allows multiple swap sequences to execute in parallel as long as they are spatially (and temporally) non-overlapping conflicting transaction stalls and retries or aborts Fine-grain synchronization (fgs-tm) enclose each swap operation within a transaction atomicity as in fgs-lock shorter but more transactions Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 9 / 27

10 Performance of FGMT with TM Multithreaded speedup perfbar+cgs-lock perfbar+fgs-lock perfbar+cgs-tm perfbar+fgs-tm TM-based schemes offer speedup up to 1.1 less overhead for cgs-tm, yet equally able to exploit available concurrency Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 1 / 27

11 Helper-Threading Scheme Motivation expose more parallelism to each thread eliminate costly barrier synchronization Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

12 Helper-Threading Scheme Motivation expose more parallelism to each thread eliminate costly barrier synchronization Rationale in serial, relaxations are performed only from the extracted (settled) vertex Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

13 Helper-Threading Scheme Motivation expose more parallelism to each thread eliminate costly barrier synchronization Rationale in serial, relaxations are performed only from the extracted (settled) vertex allow relaxations for out-edges of queued vertices, hoping that some of them might already be settled main thread operates as in the serial algorithm assign the next t vertices in the queue (x 2... x t+1 ) to t helper threads helper thread k relaxes all out-edges of vertex x k i 55 B i-1 A C D 2 7 S E 7 speculation on the status of d(x k ) if already optimal, main thread will be offloaded if not optimal, any suboptimal relaxations will be corrected eventually by main thread Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

14 Execution Pattern Serial FGMT Helper Threads step k step k+1 step k+2 Main step k step k+1 step k+2 Thread 4 Thread 3 Thread 2 Thread 1 extract-min read tid th -min relax edges step k step k+1 step k+2 Main kill kill kill Helper 1 kill Helper 2 kill Helper 3 the main thread stops all helpers at the end of each iteration unfinished work will be corrected, as with mis-speculated distances Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

15 Helper-Threading Scheme Main thread Helper thread while Q do u ExtractMin(Q); done ; foreach v adjacent to u do sum d[u] + w(u, v); Begin-Xact if d[v] > sum then DecreaseKey(Q, v, sum); d[v] sum; π[v] u; End-Xact end Begin-Xact done 1; End-Xact end while Q do while done = 1 do ; x ReadMin(Q, tid) stop foreach y adjacent to x and while stop = do Begin-Xact if done = then sum d[x] + w(x, y) if d[y] > sum then DecreaseKey(Q, y, sum) d[y] sum π[y] x else stop 1 End-Xact end end for a single neighbour, the check for relaxation, updates to the heap, and updates to d,π arrays, are enclosed within a transaction performed all-or-none on a conflict, only one thread commits interruption of helper threads implemented through TM, as well Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

16 Helper-Threading Scheme Main thread 1 while Q do 2 u ExtractMin(Q); 3 done ; 4 foreach v adjacent to u do 5 sum d[u] + w(u, v); 6 Begin-Xact 7 if d[v] > sum then DecreaseKey(Q, v, sum); 9 d[v] sum; 1 π[v] u; 11 End-Xact 12 end 13 Begin-Xact 14 done 1; 15 End-Xact 16 end Why with TM? Helper thread while Q do while done = 1 do ; x ReadMin(Q, tid) stop foreach y adjacent to x and while stop = do Begin-Xact if done = then sum d[x] + w(x, y) if d[y] > sum then DecreaseKey(Q, y, sum) d[y] sum π[y] x else stop 1 End-Xact end end Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

17 Helper-Threading Scheme Main thread 1 while Q do 2 u ExtractMin(Q); 3 done ; 4 foreach v adjacent to u do 5 sum d[u] + w(u, v); 6 Begin-Xact 7 if d[v] > sum then DecreaseKey(Q, v, sum); 9 d[v] sum; 1 π[v] u; 11 End-Xact 12 end 13 Begin-Xact 14 done 1; 15 End-Xact 16 end Helper thread 1 while Q do 2 while done = 1 do ; 3 x ReadMin(Q, tid) 4 stop 5 foreach y adjacent to x and while stop = do 6 Begin-Xact 7 if done = then sum d[x] + w(x, y) 9 if d[y] > sum then 1 DecreaseKey(Q, y, sum) 11 d[y] sum 12 π[y] x 13 else 14 stop 1 15 End-Xact 16 end 17 end Why with TM? composable all dependent atomic sub-operations composed into a large atomic operation, without limiting concurrency optimistic easily programmable Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

18 Experimental Setup Full-system simulation Simics in conjunction with GEMS toolset 2.1 boots unmodified Solaris 1 (UltraSPARC III Cu) LogTM ( Signature Edition ) eager version management eager conflict detection on a conflict, a transaction stalls and either retries or aborts HYBRID conflict resolution policy favors older transactions Hardware platform Software single CMP system (configurations up to 32 cores) private L1 caches (64KB), shared L2 cache (2MB) Pthreads for threading and synchronization Simics magic instructions to simulate idealized barriers Sun Studio 12 C compiler (-xo3) Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

19 Graphs Three graph families Random: G (n, m) model SSCA#2: cliques with varying size (1 C ) connected with probability P R-MAT: power-law degree distributions GTgraph graph generator Fixed #nodes (1K), varying density sparse ( 1K edges) medium ( 1K edges) dense ( 2K edges) Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

20 Speedups ssca2-1x2351 ssca2-1x1153 rmat-1x perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper rmat-1x1 rand-1x1 rand-1x perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper Helper-Threading perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper speedups in 6 out of 9 cases (not all shown), up to 1.46 performance improves with increasing density main thread not obstructed by helpers (<1% abort rate in all cases) FGMT with TM speedups only with perfect barriers optimistic parallelism does exist in concurrent queue updates Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

21 Conclusions FGMT conventional synchronization mechanisms incur unacceptable overhead TM reduces overheads and highlights the existence of parallelism, but still requires very efficient barriers to offer some speedup HT with TM exposes more parallelism and eliminates barrier synchronization noteworthy speedups with minimal code extensions Future work more aggressive parallelization schemes dynamic adaptation of helper threads to algorithm s execution phases explore impact of TM characteristics applicability of HT on other SSSP algorithms ( -stepping, Bellman-Ford) and other similar ( greedy ) applications Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 1 / 27

22 Thank you! Questions? Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

23 BACKUP SLIDES Anastopoulos et al. (NTUA) MTAAP 9 May 29, 29 2 / 27

24 DecreaseKey Input : min-priority queue Q, vertex u, new key value for vertex u Q[u] value; i u; while (parent(i).key value) do Begin-Xact if swap(u, i) then i parent(i); End-Xact end Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

25 Distribution of DecreaseKey s between main and helper threads (a) 1Kx1K (b) 1Kx2K (c) 1Kx1K % DecreaseKey ops w.r.t. serial execution serial HT-main HT-helper HT-all % DecreaseKey ops w.r.t. serial execution serial HT-main HT-helper HT-all % DecreaseKey ops w.r.t. serial execution serial HT-main HT-helper HT-all Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

26 Overall and main thread transaction abort rates (a) 1Kx1K (b) 1Kx2K (c) 1Kx1K 3 25 overall main thread overall main thread..7 overall main thread Abort rate Abort rate Abort rate Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

27 Breakdown of main thread s total cycles (a) 1Kx1K (b) 1Kx2K (c) 1Kx1K Main thread cycle breakdown 1.6e+7 1.4e+7 1.2e+7 1e+7 e+6 6e+6 4e+6 2e+6 total cycles non-xact cycles xact cycles xact overhead cycles Main thread cycle breakdown 6e+7 5e+7 4e+7 3e+7 2e+7 1e+7 total cycles non-xact cycles xact cycles xact overhead cycles Main thread cycle breakdown 1.6e+ 1.4e+ 1.2e+ 1e+ e+7 6e+7 4e+7 2e+7 total cycles non-xact cycles xact cycles xact overhead cycles Non-transactional (non-xact), transactional (xact) and overhead (xact-overhead) Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

28 Percentage of useful (xact) and wasted (xact-overhead) transactional cycles (a) 1Kx1K (b) 1Kx2K (c) 1Kx1K %Cycle breakdown for average transaction xact cycles xact overhead cycles %Cycle breakdown for average transaction xact cycles xact overhead cycles %Cycle breakdown for average transaction xact cycles xact overhead cycles Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

29 Write-set sizes Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

30 The timeline of execution Cycles of main thread (per 1 iterations) 3.5e+6 3e+6 2.5e+6 2e+6 1.5e+6 1e+6 5 serial HT-2thr HT-4thr HT-best Outer loop iterations Anastopoulos et al. (NTUA) MTAAP 9 May 29, / 27

CSLab. Utilizing Underlying Synchronization Mechanisms for Efficient Support of Different Programming Models. Nikos Anastopoulos.

CSLab. Utilizing Underlying Synchronization Mechanisms for Efficient Support of Different Programming Models. Nikos Anastopoulos. Utilizing Underlying Synchronization Mechanisms for Efficient Support of Different Programming Models Nikos Anastopoulos Computing Systems Laboratory School of Electrical and Computer Engineering National