Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

Size: px

Start display at page:

Download "Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons"

Georgina Elliott
5 years ago
Views:

1 Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering and Computer Science) Joint work with Alex Pothen

2 Scien/fic inquiry EXERIMENTAL THEORETICAL COMPUTATIONAL (Simula/on)

3 Scien/fic inquiry EXERIMENTAL THEORETICAL COMPUTATIONAL (Simula/on) The 4 th PARADIGM (Data) connectedness

technological systems we use The economic systems we live in The poli/cal

4 Complex connectedness is everywhere! The social interconnec/ons we have The informa/on we consume The technological systems we use The economic systems we live in The poli/cal systems we operate in The organiza/ons we work at The ins/tu/ons we belong to The ecological systems around us Ourselves (cell, brain).

5 Combinatorial models and algorithms in computa/onal sciences Embedded in scien/fic compu/ng Matrix factoriza/ons Matchings Vertex orderings Parallel compu/ng Independent sets Graph colorings At forefront of discovery Data analysis Network science Exploring the interplay between combinatorial and numerical algorithms crucial for developing scalable methods on HPC pla\orms

6 Challenges on manycore compu/ng: general Programming models Algorithm and data structure design Memory management Energy consump/on

7 Challenges specific to combinatorial (graph) algorithms Low available concurrency Poor data locality Irregular memory access pa]ern Access pa]ern determined only at run/me High data access to computa/on ra/o

8 Some promising algorithmic ``paradigms (for parallelizing graph algorithms) 1. Specula/on- and- itera/on 2. Approximate update 3. Parallelized search tree

9 1. Specula/on- and- itera/on Idea Maximize concurrency by tenta/vely tolera/ng poten/al inconsistencies, and then detect and resolve inconsistencies later, itera/vely.

10 Specula/on- and- itera/on example: parallelizing greedy coloring Independent- set based (prior approaches) Find maximal independent set in parallel (Luby s algorithm) Limited success Specula/on- and- itera/on Dataflow ITERATIVE(G =(V,E)) U V while U is not empty do 1. Speculatively color vertices in U in parallel 2. Check consistency of colors in U in parallel, store conflicts in R U R Fine- grain (edge- level) synchroniza/on; no itera/on Feasible when there is HW support for FGS

11 Specula/on- and- itera/on based coloring on distributed- memory architectures Exploit ini/al data distribu/on Proceed in rounds, each having two phases: tenta/ve coloring conflict detec/on Superstep 1 Communicate Organize coloring phase in supersteps Use randomiza/on in resolving conflicts Round 1 Round Superstep 2 Communicate Detect conflicts Superstep 1 Communicate Detect conflicts

12 Sample experimental results: distributed memory Distance- 1 coloring 5- point grid graph, 32K by 32K grid, 2D distributed IBM Blue Gene/P E E 01 Actual Ideal Compute time in seconds Actual Ideal Compute time in seconds (log scale) 1.25E E E E ,000 x 8,000 16,000 x 16,000 32,000 x 32,000 1,024 4,096 16,384 Grid dimensions (top) and number of processors (bottom) 7.81E ,024 2,048 4,096 8,192 16,384 # of processors Weak scaling Strong scaling Catalyurek, Dobrian, Geberemedhin, Halappanavar, Pothen IPDPS 2011

$Study on mul/threaded pla\orms Intel Nehalem Sun Niagara 2 Cray XMT HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core 3 HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core 3 0 1 2 3 4 5$

13 Study on mul/threaded pla\orms Intel Nehalem Sun Niagara 2 Cray XMT HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core 3 HT0 HT1 Core 0 HT0 HT1 Core 1 HT0 HT1 Core 2 HT0 HT1 Core Core 0 Core 1 Core 7 Core 0 Core 1 Core 7 Processor 0 Processor 1 Processor 127 L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Memory Controller Memory Controller Memory Controller Shared L3 Cache Shared L3 Cache 8x9 Cache Crossbar 8x9 Cache Crossbar Switch Buffer Switch Buffer Switch Buffer Shared L2 Cache (8 Banks of 512 KB) Shared L2 Cache (8 Banks of 512 KB) Hypertransport Hypertransport Hypertransport Memory Controller QPI QPI Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Bank (8GB) Memory Bank (8GB) Memory Bank (8GB) Memory Bank Memory Bank Memory Bank Memory Bank 3D Torus Network Shared Global Virtual Memory Shared Global Virtual Memory Shared Global Virtual Memory (8 GBytes X 128 = 1 TBytes) With hardware shuffling at 64 Bytes granularity two quad- core chips two hyperthreads per core private L1 and L2 cache, shared L3 cache two 8- core sockets 8 hardware threads per socket L1 cache on core, shared L2 cache 128 processors 128 hardware thread streams per processor cache- less, globally accessible shared memory hardware support for fine- grain synchroniza/on Catalyurek, Feo, Gebremedhin, Halappanavar, Pothen. Parallel Compu/ng, 2012

14 Experimental results: distance- 1 coloring Small- world graphs with ver/ces and 134M 1B edges Cray XMT!"#$%&'$()*+',% $&#% #$)% &#(% )'% *#% &)% (% '% #% &% Itera/ve +,-./#'% +,-./#$% +,-./#)% +,-./#0%!"#$%&'$()*+',% %#$& $%)& #$(& )'& *$& #)& (& '& $& #&!"%& Dataflow +,-./$'& +,-./$%& +,-./$)& +,-./$0& Cray XMT!"$%!"$%&!"#$% &% #% '% (% &)% *#% )'% &#(%!"#$%& #& $& '& (& #)& *$& )'& #$(& -.#/$0%)1%20)($'')0'% -.#/$0%)1%20)($'')0'% Small- world graph with 2 24 = 16M ver/ces and 134M edges #(&"!#%" &$"!")*+,-./01+," #")*+,-.2/01+," $")*+,-.2/01+," '#"!&"!"()*+,-./0*+" #"()*+,-1./0*+" Niagara 2!"#$%&'$()*+',% '#"!&" %" $" Itera/ve %")*+,-.2/01+,"!"#$%&'$()*+',% %" $" #" Itera/ve Nehalem #"!"!" #" $" %"!&"!"!" #" $" %" -.#/$0%)1%()0$'% -.#/$0%)1%()0$'%

15 2. Approximate update Idea Minimize synchroniza/on cost by op/ng for concurrent data structure update with approximate data instead of serialized data structure update with exact data

Approximate update example: Smallest Last

$degree in V \ {v n, v n- 1,.$ .., v i+1 } Back degree Forward degree v1

16 Approximate update example: Smallest Last ordering Ordering Property Smallest Last for i = n to 1: v i has smallest back degree in V \ {v n, v n- 1,..., v i+1 } Back degree Forward degree v1 v 2 v i v n 1 v n B π (G): maximum back degree in π Degree B*(G) = min π B π (G) = B SL (G) (min among n! possibili/es) δ*(g) = maximum minimum degree in an induced subgraph of G (max among 2 n possibili/es) B SL (G) = δ*(g)

18 Parallelizing SL ordering Considered two approaches Approach 1: Regular Parallelizes ordering closely maintaining serial behavior Maintains a global bin array B, and local (per thread) bin arrays B k, for k =1 to p Needs to deal with three poten/al problems ( race condi/ons ) A pair of ver/ces in an extreme bin are adjacent to each other Removal of mul/ple ver/ces from the same bin Addi/on of mul/ple ver/ces to the same bin Approach 2: Relaxed Se]le for an approximate solu/on in favor of increased concurrency Works with only local bin arrays In upda/ng loca/ons of ver/ces in bin structure, approximate dynamic degrees used

19 Experimental results: ordering, scalability, g- graphs time / time using 1 thread (%) g1 g2 g3 g4 g5 time / time using 1 thread (%) g1 g2 g3 g4 g threads threads SL- Regular SL- Relaxed Patwary, Gebremedhin, Pothen EuroPar 2011

20 3. Parallelized search tree Idea In a branch- and- bound algorithm, exchange bounds among processors immediately so as to realize superlinear speedup

21 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue)

22 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue) bio log Runtime ω/ω collab 3. inter 4. retweet 5. tech 6. web 7. faceboo log V + E social

Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and

23 Parallelized search tree example: Clique algorithms and applica/on Developed fast branch- and- bound algorithm for finding maximum clique Algorithm applied to analyze large- scale social and informa/on networks compute strongly connected components in temporal networks WWW2014. Collaborators: Gleich and Rossi (Purdue) Superlinear speed up due to parallelized search tree Speedup brock400 4 (331) san (1) san (0.2) brock800 4 (3604) brock400 3 (619) p hat (4) san1000 (1) Processors

24 Related libraries ColPack A package consis/ng of implementa/ons of a variety of graph coloring, vertex ordering and related problems in support of sparse Jacobian and Hessian computa/on via Automa/c Differen/a/on MTCOL Mul/threaded codes for select graph coloring and vertex ordering problems Parallel Maximum Clique Finder (PMC) A fast parallel (shared- memory) implementa/on for finding maximum cliques in large sparse networks For further info visit:

25 Broader themes Local computa/on algorithms Concurrent data structures Resilience

26 Funding acknowledgements DOE Office of Science (current) CSCAPES (SciDAC- 2) NSF

27 Thank you!

Parallel Graph Coloring For Many- core Architectures

Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated