Parallel Graph Coloring For Many- core Architectures

Size: px

Start display at page:

Download "Parallel Graph Coloring For Many- core Architectures"

Leslie Martin
5 years ago
Views:

1 Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

2 Outline Performance portable graph coloring algorithms and implementa;ons that run on mul;- and many- core plaborms (e.g. Xeon Phi, GPUs) Using Kokkos Library Study of implementa;on issues for vertex- based methods A new edge- based coloring algorithm Empirical comparisons on a range of graphs on both Xeon Phi and GPU Demonstrated impact of coloring on real solver applica;ons M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 2

3 Problem Defini;on Given a graph G = ( V, E), With ver;ces v V Edges (v 1, v 2 ) E v 1, v 2 V Image courtesy of Sariyuce, Saule, Catalyurek, SIAM PP, 2012 Distance- 1 graph coloring: assigns colors to ver;ces so that each vertex have different color from all of its neighbors C :V N C(v 1 ) C(v 2 ) for all (v 1, v 2 ) E The dis;nct number of colors assigned to ver;ces: C Graph coloring problem that minimizes C is NP- Hard [Zuckerman, 2006] Applica;ons: Parallel computa;on, Jacobian computa;on, Register alloca;ons 3

4 Graph Coloring Heuris;cs Simple greedy heuris;cs oien obtain near op;mal solu;ons First- fit [Matula, 1972], with O(V + E ) Keeps forbidden array to store the colors of neighbors Obtains C δ +1 where δ max degree in the graph Parallel Implementa;ons Specula;ve Method [Gebremedhin and Manne,2000], [Bozdag, 2008] [Jones and Plassmann, 1993] paralleliza;on of [Luby, 1986] Distributed Implementa;ons: [Catalyurek,2012] Hybrid MPI+OpenMP Implementa;ons: [Sariyuce, 2012] 4

5 Many- core Coloring Heuris;cs Yet, only few work has been done on many- core architectures Xeon Phi: Specula;ve Method (IPGC) [Saule, 2012] Specula;vely color ver;ces in each threads Detect conflicts due to race condi;ons and recolor them GPUs: Nvidia cusparse: [Naumov, 2015] Relaxa;on of Jones and Plassmann (JP) based on the independent sets Highly parallel, runs fast But the number of colors found are usually very high 5

6 Vertex- Based Coloring Algorithms Minimum atomic work are ver;ces, therefore 1 vertex is owned by a single thread: IPGC, cusparse Some implementa;on details are oien ignored e.g. the requirement of thread private Forbidden array O(δ) can be a problem on highly irregular graphs, or when number of threads are high Op;miza;on: Limit the size of Forbidden array e.g. with constant size 32 (called VB) Traverse the adjacency mul;ple ;mes first for the ver;ces with colors 1-32, then On GPUs this array can be stored in slow local memory Use the bits of single int (called VBBIT) Conversion to back and forth to bit representa;on Stored in registers on GPU rather than slow memory 6

7 Edge- Based Coloring (EB) Threads traverses edges, only par;al informa;on about the vertex 3 phases Assign Colors Vertex- based Detect Conflicts Edge- based Forbid Colors Edge- Based Per vertex Forbidded Sets 7

8 Edge- Based Coloring - Round - 2 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 8

9 Edge- Based Coloring - Round - 3 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 9

10 Edge- Based Coloring Problems Memory: 1 Forbidden array per vertex. O(V δ ) Convergence issues Conflicts occurs even when there is no race condi;ons High number of edge traversals E >> V 10

11 EB Op;miza;ons- Memory O( 2 V ) Represent colors with 2 integers Set bit index denotes the color e.g. Color- 10 = To represent more than 32 colors use color sets (CS) CS(0) = [1,32], CS(1) = [33,64] Ini;ally CS(v) = 0,Color(v) = 0 Each vertex s Forbidden array can be represented with a single integer Each itera;on, a thread writes to Forbidden with an atomic or only if CS(v 1 ) = CS(v 2 ) Assign Colors - > O(1) with 2 s complement. No available color in current CS increment CS and seek for colors in next CS 11

12 EB Op;miza;ons- Convergence Coloring decisions ignore the colors assigned in the current itera;on, causing a lot of conflicts regardless of race condi;ons Tenta;ve coloring for the edges that have 2 uncolored ver;ces Threads tenta;vely color one of them, and Forbid that color on the other end. 12

13 EB Op;miza;ons Edge Filtering E >> V, therefore it is necessary to keep worklist during the algorithm execu;on An edge can be filtered out if: If both ends have colored ver;ces See paper for other cases Tried atomic or parallel prefix sum to create the work list PPS was beyer, which is used for the rest of the experiments 13

14 Experiments Publicly available implementa;ons are under KokkosKernels (TpetraKernels) package of Trilinos Experiments on compton Intel Xeon Phi (MIC), and shannon (GPU) clusters of Sandia Na;onal Labs KNC- MIC : 57 cores, 4 hyperthreads at 1.1 GHz with 6 GB (always run with 228 threads), icc Compared against IPGC [Saule, 2012] NVIDIA Tesla K20X GPUs : CC 3.5, 6GB Memory, Cuda Compared against cusparse Experiments study Performance of graph coloring coloring execu;on ;me Coloring quality - number of colors The effect of coloring on the execu;on ;me of a real applica;on 14

15 Experiment Matrices As a measure for how irregular the graph is 15

16 Coloring on GPU Speedup&w.r.t.&cuSPARSE& 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 63.87& 1.22& 11.01& 1.75& 2.38& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& Speedup! Higher is better 1.20$ EB& VB& VBBIT& cusparse& 0.66& 0.41& 0.19& 0.44& 0.64& Irregularity ", VB variants Less coalesced Load imbalance Exec;me EB < cusparse for irregular graphs. But its quality is always beyer. Upto 48x less colors on circuit VBBIT helps w.r.t VB Normalized$#colors$w.r.t.$ cusparse$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better 0.32$ 0.39$ EB$ VBBIT$ 0.19$ 0.33$ VB$ cusparse$ 0.34$ EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT Norm # colors w.r.t SequenAal cusparse 3.57 EB 1.06 VB 1.16 VBBIT

17 Coloring on MIC Speedup%w.r.t.%IPGC% 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% Normalized$#colors$w.r.t.$IPGC$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% Speedup! Higher is better 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ EB$ VB$ VBBIT$ IPGC$ 0.99$ EB% VB% VBBIT% IPGC% 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better EB is worse than VBs MIC has less threads, it is more forgiving to load imbalances VB is better than IPGC: better caching VB is worse on graphs with lots of colors VBBIT does not help on MIC. Caching vs BIT ops Speedup of VB 1.11 w.r.t IPGC 4.68 w.r.t EB Norm # colors w.r.t SequenAal IPGC 1.13 EB 1.05 VB 1.14 VBBIT

18 Mul;- Threaded Gauss Seidel 15% Implemented Conjugated Gradient 10% Precondi;oner: mul;- threaded Gauss- Seidel 5% Very sequen;al algorithm 0% Coloring to find independent rows Then opera;ons can be done in parallel for independent rows More colors! more synchroniza;on, less work in parallel regions Run on GPUs with coloring results of EB and cusparse for 7 Problems: Overall:1.39 and 1.17 speedups w.r.t. cusparse on Gauss- Seidel and overall solve ;me. 20% EB% cusparse% 6.84% 13.3% MT.GS% 12.17% 18.05% Overall% 1.94 and 1.48 speedups on circuit 18

19 Summary & Future Direc;ons Proposed a new parallel edge- based graph coloring algorithm Proposed several improvements to the tradi;onal specula;ve vertex- based algorithm Showed that edge- based method is usually faster than vertex- based ones on GPUs but it is slower on the Xeon Phi Showed that edge- based algorithm obtains beyer quality colorings than other variants up to 48x less colors than cusparse Future: Distance- 2 coloring 19

For more informa;on KokkosKernels: Download through Trilinos: hyp://trilinos.org Public git repository: hyp://github.com/trilinos For more informa;on: mndevec@sandia.

20 For more informa;on KokkosKernels: Download through Trilinos: hyp://trilinos.org Public git repository: hyp://github.com/trilinos For more informa;on: Thanks to: NNSA ASC program DOE ASCR SciDAC FASTMath Ins;tute ATDM Paper: M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 20

21 References D. Zuckerman, Linear degree extractors and the inapproximability of max clique and chroma;c number, in Proc 38th ACM Symp Theory of Compu;ng, 2006, pp D. W. Matula, G. Marble, and J. Isaacson, Graph coloring algorithms, in Graph Theory and Compu;ng, R. Read, Ed. Academic Press, 1972, pp A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, Colpack: Soiware for graph coloring and related problems in scien;fic compu;ng, ACM Trans Math Soiw, vol. 40, no. 1, pp. 1 31, M. T. Jones and P. Plassmann, A parallel graph coloring heuris;c, SIAM J Sci Comput, vol. 14, no. 3, pp , M. Luby, A simple parallel algorithm for the maximal independent set problem, SIAM J Comput, vol. 15, no. 4, pp , A. H. Gebremedhin and F. Manne, Scalable parallel graph coloring algorithms, Concurrency: Prac;ce & Experience, vol. 12, no. 12, pp , D. Bozdag, A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Catalyurek, A framework for scalable greedy coloring on distributed memory parallel computers, J Parallel Distrib Comp, vol. 68, no. 4, pp , U. V. Catalyurek, J. Feo, A. H. Gebremedhin, M. Halappanavar, and A. Pothen, Graph coloring algorithms for mul;- core and massively mul;threaded architectures, Parallel Compu;ng, vol. 38, no. 10, pp , A. E. Sariyuce, E. Saule, and U. V. C ataly urek, Scalable hybrid implementa;on of graph coloring using MPI and OpenMP, in Proc IEEE 26th Internat Parallel & Distrib Proc Symp Workshops, 2012, pp E. Saule and U. V. Catalyurek, An early evalua;on of the scalability of graph algorithms on the Intel MIC architecture, in Proc 26th IEEE Internat Parallel Distrib Proc Symp Workshops, 2012, pp M. Naumov, P. Castonguay, and J. Cohen, Parallel graph coloring with applica;ons to the incomplete- LU factoriza;on on the GPU, NVIDIA, Tech. Rep.,

22 Overall Performance 10" 10" #"of"graph"instances" 8" 6" 4" 2" 8" 6" 4" 2" cusparse" IPGC" VB" VB"BIT" EB"PPS" 0" 0" 1" 1.5" 2" 2.5" 3" 1" 1.5" 2" 2.5" 3" GPU MIC Overall: EB has 1.49 speedup w.r.t cusparse, and 5.36 w.r.t VBBIT on GPU VB has 1.11 speedup w.r.t IPGC, 4.68 w.r.t EB on MIC 22

23 Scaling 350" 100" Time"(ms)"per"million"edge" 300" 250" 200" 150" 100" 50" cusparse" VB" EB"PPS" VB"BIT"EF" IPGC" VB"BIT" VB"EF" 80" 60" 40" 20" 0" 0" 50" 100" 150" #"edges" Millions" GPU 0" 0" 50" 100" 150" #"edges" Millions" MIC 23

24 Overall Performance GEOMEAN 24

25 Coloring Results- GEOMEAN Average number of colors of 5 execu;on by each algorithm 25

26 Coloring Time Speedup&w.r.t.&cuSPARSE& Speedup%w.r.t.%IPGC% 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% EB& VB& VBBIT& cusparse& 63.87& 1.22& 11.01& 1.75& 2.38& 0.66& 0.41& 0.19& 0.44& 0.64& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& GPU EB% VB% VBBIT% IPGC% 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% MIC As the graphs becomes more irregular, VB algs: Less coalesced access Load imbalance EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT VB on MIC 1.11 w.r.t IPGC 4.68 w.r.t EB MIC has less threads, more forgiving to thread load-imbalances 26

27 Coloring Quality 1.20$ Normalized$#colors$w.r.t.$ cusparse$ Normalized$#colors$w.r.t.$IPGC$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ EB$ VB$ VBBIT$ cusparse$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ 0.32$ 0.39$ 0.19$ 0.33$ 0.34$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ GPU EB$ VB$ VBBIT$ IPGC$ 0.99$ 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ Geometric mean of # colors normalized w.r.t sequential GPU MIC cusparse 3.57 IPGC 1.13 EB 1.06 EB 1.05 VB 1.16 VB 1.14 VBBIT 1.16 VBBIT $ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ MIC 27

28 Mul;- Threaded Gauss Seidel 1.84, 1.37 speedups on SGS, PCG 1.29, 1.07 speedups 1.25, 1.09 speedups 1.94, 1.48 speedups 1.32, 1.11 speedups 1.23, 1.07 speedups 1.10, 1.08 speedups 28

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering