Implementation of BFS on shared memory (CPU / GPU)

Size: px

Start display at page:

Download "Implementation of BFS on shared memory (CPU / GPU)"

Ruby Hampton
6 years ago
Views:

1 Kolganov A.S., MSU

2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2

3 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 3

4 Breadth-first search one of the most important and fundamental processing algorithms in graphs; Алгоритмические трудности BFS: Very few computations; An irregular memory access. 4

5 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 5

6 Using BFS algorithm for ranking supercomputers (TEPS traversed edges per second); Using the MTEPS / WATT metrics for ranking in the GreenGraph500 rating of energy-efficient supercomputers ; The both lists have not yet been filled: Graph500 (201 positions in the list); GreenGraph500 (63 positions in the list). 6

7 Generating of edges; Building a graph from edges (timed, included in the table); Generating of 64 random vertices; For each vertex: Running BFS algorithm; (timed, included in the rating); Checking the result; Printing the resulting information. 7

8 nodes cores scale 8

9 12,6 MW 7,8 MW 3,9 MW nodes cores scale 17,8 MW x2 x2 9

10 Big DATA Small DATA Rank MTEPS/W Machine Scale GTEPS Nodes Cores 1 62,93 GraphCREST (CPU) 30 31, ,48 GraphCREST (CPU) 30 28, ,95 GraphCREST (CPU) 32 59, ,28 GraphCREST (CPU) 30 31, ,42 GraphCREST (CPU) 32 55, Rank MTEPS/W Machine Scale GTEPS Nodes Cores 1 815,68 TitanX (GPU) , ,94 Titan (GPU) , ,92 Colonial (GPU) , ,42 Monty Pi-thon 26 35, ,15 GraphCREST (ARM) 20 1, ,4 GraphCREST (ARM) 20 0, ,38 EBD 21 1,

11 Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA Rank MTEPS/W Machine Scale GTEPS Nodes Cores 1 62,93 GraphCREST (CPU) 30 31, ,48 GraphCREST (CPU) 30 28, ,95 GraphCREST (CPU) 32 59, ,28 GraphCREST (CPU) 30 31, ,42 GraphCREST (CPU) 32 55, Rank MTEPS/W Machine Scale GTEPS Nodes Cores 1 815,68 TitanX (GPU) , ,94 Titan (GPU) , ,92 Colonial (GPU) , ,42 Monty Pi-thon 26 35, ,15 GraphCREST (ARM) 20 1, ,4 GraphCREST (ARM) 20 0, ,38 EBD 21 1,

12 Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA 12ГБ, Tesla K80 24ГБ; For computing scale 30 needed ~192ГБ; <> x 16 = 192 ГБ 4 kw peak! <Tesla K80> x 8 = 192 ГБ 2.4 kw peak! Rank MTEPS/W Machine Scale GTEPS Nodes Cores 1 815,68 TitanX (GPU) , ,94 Titan (GPU) , ,92 Colonial (GPU) , ,42 Monty Pi-thon 26 35, ,15 GraphCREST (ARM) 20 1, ,4 GraphCREST (ARM) 20 0, ,38 EBD 21 1,

13 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 13

14 Phase 1: reconstruction and transformation of graph; loading to GPU memory; Phase 2: The main cycle of algorithm; Use the hybrid BFS (Top Down + Bottom Up). The main ideas were taken from GraphCREST: «Fast and Energy-efficient Breadth-First Search on a Single NUMA System, 2014» 14

15 Transformation to CSR (compressed sparse rows) COO start vertex.. CSR adj_ptr.. final vertex weights adjacency weights 15

16 Global sorting of vertices by the degree of connectedness 16

17 Local sorting of neighbors by the degree of connectedness V1 V2 V Vn V1 V2 V Vn 17

18 Synchronized on levels Top-Down Current front Level K Next iteration front Level K+1 foreach (i = [0, N]) { foreach (k = =[rind[i], rind[i+1]) { unsigned v = endv[k]; if (levels[v] == 0) { levels[v] = lvl; parents[v] = i; } } } 18

19 Synchronized on levels Bottom-Up Level K Level K+1 foreach (i = [0, N]) { if (levels[i] == 0) { foreach (k=[rind[i], rind[i+1] ]) { unsigned endk = endv[k]; if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break; } } } } 19

20 Hybrid algorithm: Top-Down + Bottom-Up (direction optimization) The graph SCALE 26 V = 2^26 (67,108,864) E = 2^30 (1,073,741,824) Level Top-Down Bottom-Up Hybrid 0 2 2,103,840, ,206 1,766,587,029 66, ,918,235 52,677,691 52,677, ,727,195,615 12,820,854 12,820, ,557, , , ,357 21,467 21, , Total: 2,103,820,036 3,936,072,360 65,689, % 187% 3.12% = 2x E A significant decrease the number of edges viewed

21 Using the CUDA Dynamic Parallelism for balancing load in Top-Down; Using vectorization in each thread; Using the align reordering for better access to memory in Bottom-Up; Using queue in Bottom-Up at the last iterations. 21

22 The first position in GGraph500: Small DATA: 132 GTEPS, 815 MTEPS/W, SCALE:26; The second position in GGraph500: Small DATA: GTX Titan 114 GTEPS, 540 MTEPS/W, SCALE:25; The 15th position in GGraph500: Small DATA: Intel Xeon E GTEPS, 81 MTEPS/W, SCALE:27; Reached memory bandwidth GPU at GB/s (50-60% of peak); Reached energy consumption at 50% of peak. 22

23 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 23

24 The time of BFS on 1GPU SCALE 26 ~ 8.43ms CPU CPU 24

25 The all coping GPU->HOST: ~128 МБ CPU CPU 25

26 The all back coping HOST->GPU: ~2000 МБ CPU CPU 26

27 The time of BFS on 16 GPU SCALE 30 ~ 9 ms The total time of coping SCALE 30 ~ 140 ms CPU CPU 115 GTEPS ~ 100 MTEPS / W (the current 1 st position 62,93 MTEPS / W ) 27

28 The time of BFS on 16 GPU SCALE 30 ~ 9 ms The total time of coping SCALE 30 ~ 30 ms CPU CPU NVlink 440 GTEPS ~ 300 MTEPS / W 28

29 Alexander Kolganov, MSU, 29

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems