Parallel Graph Coloring For Many- core Architectures
|
|
- Leslie Martin
- 5 years ago
- Views:
Transcription
1 Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.
2 Outline Performance portable graph coloring algorithms and implementa;ons that run on mul;- and many- core plaborms (e.g. Xeon Phi, GPUs) Using Kokkos Library Study of implementa;on issues for vertex- based methods A new edge- based coloring algorithm Empirical comparisons on a range of graphs on both Xeon Phi and GPU Demonstrated impact of coloring on real solver applica;ons M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 2
3 Problem Defini;on Given a graph G = ( V, E), With ver;ces v V Edges (v 1, v 2 ) E v 1, v 2 V Image courtesy of Sariyuce, Saule, Catalyurek, SIAM PP, 2012 Distance- 1 graph coloring: assigns colors to ver;ces so that each vertex have different color from all of its neighbors C :V N C(v 1 ) C(v 2 ) for all (v 1, v 2 ) E The dis;nct number of colors assigned to ver;ces: C Graph coloring problem that minimizes C is NP- Hard [Zuckerman, 2006] Applica;ons: Parallel computa;on, Jacobian computa;on, Register alloca;ons 3
4 Graph Coloring Heuris;cs Simple greedy heuris;cs oien obtain near op;mal solu;ons First- fit [Matula, 1972], with O(V + E ) Keeps forbidden array to store the colors of neighbors Obtains C δ +1 where δ max degree in the graph Parallel Implementa;ons Specula;ve Method [Gebremedhin and Manne,2000], [Bozdag, 2008] [Jones and Plassmann, 1993] paralleliza;on of [Luby, 1986] Distributed Implementa;ons: [Catalyurek,2012] Hybrid MPI+OpenMP Implementa;ons: [Sariyuce, 2012] 4
5 Many- core Coloring Heuris;cs Yet, only few work has been done on many- core architectures Xeon Phi: Specula;ve Method (IPGC) [Saule, 2012] Specula;vely color ver;ces in each threads Detect conflicts due to race condi;ons and recolor them GPUs: Nvidia cusparse: [Naumov, 2015] Relaxa;on of Jones and Plassmann (JP) based on the independent sets Highly parallel, runs fast But the number of colors found are usually very high 5
6 Vertex- Based Coloring Algorithms Minimum atomic work are ver;ces, therefore 1 vertex is owned by a single thread: IPGC, cusparse Some implementa;on details are oien ignored e.g. the requirement of thread private Forbidden array O(δ) can be a problem on highly irregular graphs, or when number of threads are high Op;miza;on: Limit the size of Forbidden array e.g. with constant size 32 (called VB) Traverse the adjacency mul;ple ;mes first for the ver;ces with colors 1-32, then On GPUs this array can be stored in slow local memory Use the bits of single int (called VBBIT) Conversion to back and forth to bit representa;on Stored in registers on GPU rather than slow memory 6
7 Edge- Based Coloring (EB) Threads traverses edges, only par;al informa;on about the vertex 3 phases Assign Colors Vertex- based Detect Conflicts Edge- based Forbid Colors Edge- Based Per vertex Forbidded Sets 7
8 Edge- Based Coloring - Round - 2 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 8
9 Edge- Based Coloring - Round - 3 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 9
10 Edge- Based Coloring Problems Memory: 1 Forbidden array per vertex. O(V δ ) Convergence issues Conflicts occurs even when there is no race condi;ons High number of edge traversals E >> V 10
11 EB Op;miza;ons- Memory O( 2 V ) Represent colors with 2 integers Set bit index denotes the color e.g. Color- 10 = To represent more than 32 colors use color sets (CS) CS(0) = [1,32], CS(1) = [33,64] Ini;ally CS(v) = 0,Color(v) = 0 Each vertex s Forbidden array can be represented with a single integer Each itera;on, a thread writes to Forbidden with an atomic or only if CS(v 1 ) = CS(v 2 ) Assign Colors - > O(1) with 2 s complement. No available color in current CS increment CS and seek for colors in next CS 11
12 EB Op;miza;ons- Convergence Coloring decisions ignore the colors assigned in the current itera;on, causing a lot of conflicts regardless of race condi;ons Tenta;ve coloring for the edges that have 2 uncolored ver;ces Threads tenta;vely color one of them, and Forbid that color on the other end. 12
13 EB Op;miza;ons Edge Filtering E >> V, therefore it is necessary to keep worklist during the algorithm execu;on An edge can be filtered out if: If both ends have colored ver;ces See paper for other cases Tried atomic or parallel prefix sum to create the work list PPS was beyer, which is used for the rest of the experiments 13
14 Experiments Publicly available implementa;ons are under KokkosKernels (TpetraKernels) package of Trilinos Experiments on compton Intel Xeon Phi (MIC), and shannon (GPU) clusters of Sandia Na;onal Labs KNC- MIC : 57 cores, 4 hyperthreads at 1.1 GHz with 6 GB (always run with 228 threads), icc Compared against IPGC [Saule, 2012] NVIDIA Tesla K20X GPUs : CC 3.5, 6GB Memory, Cuda Compared against cusparse Experiments study Performance of graph coloring coloring execu;on ;me Coloring quality - number of colors The effect of coloring on the execu;on ;me of a real applica;on 14
15 Experiment Matrices As a measure for how irregular the graph is 15
16 Coloring on GPU Speedup&w.r.t.&cuSPARSE& 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 63.87& 1.22& 11.01& 1.75& 2.38& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& Speedup! Higher is better 1.20$ EB& VB& VBBIT& cusparse& 0.66& 0.41& 0.19& 0.44& 0.64& Irregularity ", VB variants Less coalesced Load imbalance Exec;me EB < cusparse for irregular graphs. But its quality is always beyer. Upto 48x less colors on circuit VBBIT helps w.r.t VB Normalized$#colors$w.r.t.$ cusparse$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better 0.32$ 0.39$ EB$ VBBIT$ 0.19$ 0.33$ VB$ cusparse$ 0.34$ EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT Norm # colors w.r.t SequenAal cusparse 3.57 EB 1.06 VB 1.16 VBBIT
17 Coloring on MIC Speedup%w.r.t.%IPGC% 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% Normalized$#colors$w.r.t.$IPGC$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% Speedup! Higher is better 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ EB$ VB$ VBBIT$ IPGC$ 0.99$ EB% VB% VBBIT% IPGC% 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better EB is worse than VBs MIC has less threads, it is more forgiving to load imbalances VB is better than IPGC: better caching VB is worse on graphs with lots of colors VBBIT does not help on MIC. Caching vs BIT ops Speedup of VB 1.11 w.r.t IPGC 4.68 w.r.t EB Norm # colors w.r.t SequenAal IPGC 1.13 EB 1.05 VB 1.14 VBBIT
18 Mul;- Threaded Gauss Seidel 15% Implemented Conjugated Gradient 10% Precondi;oner: mul;- threaded Gauss- Seidel 5% Very sequen;al algorithm 0% Coloring to find independent rows Then opera;ons can be done in parallel for independent rows More colors! more synchroniza;on, less work in parallel regions Run on GPUs with coloring results of EB and cusparse for 7 Problems: Overall:1.39 and 1.17 speedups w.r.t. cusparse on Gauss- Seidel and overall solve ;me. 20% EB% cusparse% 6.84% 13.3% MT.GS% 12.17% 18.05% Overall% 1.94 and 1.48 speedups on circuit 18
19 Summary & Future Direc;ons Proposed a new parallel edge- based graph coloring algorithm Proposed several improvements to the tradi;onal specula;ve vertex- based algorithm Showed that edge- based method is usually faster than vertex- based ones on GPUs but it is slower on the Xeon Phi Showed that edge- based algorithm obtains beyer quality colorings than other variants up to 48x less colors than cusparse Future: Distance- 2 coloring 19
20 For more informa;on KokkosKernels: Download through Trilinos: hyp://trilinos.org Public git repository: hyp://github.com/trilinos For more informa;on: Thanks to: NNSA ASC program DOE ASCR SciDAC FASTMath Ins;tute ATDM Paper: M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 20
21 References D. Zuckerman, Linear degree extractors and the inapproximability of max clique and chroma;c number, in Proc 38th ACM Symp Theory of Compu;ng, 2006, pp D. W. Matula, G. Marble, and J. Isaacson, Graph coloring algorithms, in Graph Theory and Compu;ng, R. Read, Ed. Academic Press, 1972, pp A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, Colpack: Soiware for graph coloring and related problems in scien;fic compu;ng, ACM Trans Math Soiw, vol. 40, no. 1, pp. 1 31, M. T. Jones and P. Plassmann, A parallel graph coloring heuris;c, SIAM J Sci Comput, vol. 14, no. 3, pp , M. Luby, A simple parallel algorithm for the maximal independent set problem, SIAM J Comput, vol. 15, no. 4, pp , A. H. Gebremedhin and F. Manne, Scalable parallel graph coloring algorithms, Concurrency: Prac;ce & Experience, vol. 12, no. 12, pp , D. Bozdag, A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Catalyurek, A framework for scalable greedy coloring on distributed memory parallel computers, J Parallel Distrib Comp, vol. 68, no. 4, pp , U. V. Catalyurek, J. Feo, A. H. Gebremedhin, M. Halappanavar, and A. Pothen, Graph coloring algorithms for mul;- core and massively mul;threaded architectures, Parallel Compu;ng, vol. 38, no. 10, pp , A. E. Sariyuce, E. Saule, and U. V. C ataly urek, Scalable hybrid implementa;on of graph coloring using MPI and OpenMP, in Proc IEEE 26th Internat Parallel & Distrib Proc Symp Workshops, 2012, pp E. Saule and U. V. Catalyurek, An early evalua;on of the scalability of graph algorithms on the Intel MIC architecture, in Proc 26th IEEE Internat Parallel Distrib Proc Symp Workshops, 2012, pp M. Naumov, P. Castonguay, and J. Cohen, Parallel graph coloring with applica;ons to the incomplete- LU factoriza;on on the GPU, NVIDIA, Tech. Rep.,
22 Overall Performance 10" 10" #"of"graph"instances" 8" 6" 4" 2" 8" 6" 4" 2" cusparse" IPGC" VB" VB"BIT" EB"PPS" 0" 0" 1" 1.5" 2" 2.5" 3" 1" 1.5" 2" 2.5" 3" GPU MIC Overall: EB has 1.49 speedup w.r.t cusparse, and 5.36 w.r.t VBBIT on GPU VB has 1.11 speedup w.r.t IPGC, 4.68 w.r.t EB on MIC 22
23 Scaling 350" 100" Time"(ms)"per"million"edge" 300" 250" 200" 150" 100" 50" cusparse" VB" EB"PPS" VB"BIT"EF" IPGC" VB"BIT" VB"EF" 80" 60" 40" 20" 0" 0" 50" 100" 150" #"edges" Millions" GPU 0" 0" 50" 100" 150" #"edges" Millions" MIC 23
24 Overall Performance GEOMEAN 24
25 Coloring Results- GEOMEAN Average number of colors of 5 execu;on by each algorithm 25
26 Coloring Time Speedup&w.r.t.&cuSPARSE& Speedup%w.r.t.%IPGC% 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% EB& VB& VBBIT& cusparse& 63.87& 1.22& 11.01& 1.75& 2.38& 0.66& 0.41& 0.19& 0.44& 0.64& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& GPU EB% VB% VBBIT% IPGC% 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% MIC As the graphs becomes more irregular, VB algs: Less coalesced access Load imbalance EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT VB on MIC 1.11 w.r.t IPGC 4.68 w.r.t EB MIC has less threads, more forgiving to thread load-imbalances 26
27 Coloring Quality 1.20$ Normalized$#colors$w.r.t.$ cusparse$ Normalized$#colors$w.r.t.$IPGC$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ EB$ VB$ VBBIT$ cusparse$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ 0.32$ 0.39$ 0.19$ 0.33$ 0.34$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ GPU EB$ VB$ VBBIT$ IPGC$ 0.99$ 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ Geometric mean of # colors normalized w.r.t sequential GPU MIC cusparse 3.57 IPGC 1.13 EB 1.06 EB 1.05 VB 1.16 VB 1.14 VBBIT 1.16 VBBIT $ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ MIC 27
28 Mul;- Threaded Gauss Seidel 1.84, 1.37 speedups on SGS, PCG 1.29, 1.07 speedups 1.25, 1.09 speedups 1.94, 1.48 speedups 1.32, 1.11 speedups 1.23, 1.07 speedups 1.10, 1.08 speedups 28
Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons
Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering
More informationScalable Hybrid Implementation of Graph Coloring using MPI and OpenMP
Scalable Hybrid Implementation of Graph Coloring using MPI and OpenMP Ahmet Erdem Sarıyüce, Erik Saule, and Ümit V. Çatalyürek Department of Biomedical Informatics Department of Computer Science and Engineering
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationA Push- Relabel- Based Maximum Cardinality Bipar9te Matching Algorithm on GPUs
A Push- Relabel- Based Maximum Cardinality Biparte Matching Algorithm on GPUs Mehmet Deveci,, Kamer Kaya, Bora Uçar, and Ümit V. Çatalyürek, Dept. of Biomedical InformaDcs, The Ohio State University Dept.
More informationHypergraph Sparsifica/on and Its Applica/on to Par//oning
Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationSimple Parallel Biconnectivity Algorithms for Multicore Platforms
Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationOh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories
Photos placed in horizontal posi1on with even amount of white space between photos and header Oh, $#*@! Exascale! The effect of emerging architectures on scien1fic discovery Ultrascale Visualiza1on Workshop,
More informationWar Stories : Graph Algorithms in GPUs
SAND2014-18323PE War Stories : Graph Algorithms in GPUs Siva Rajamanickam(SNL) George Slota, Kamesh Madduri (PSU) FASTMath Meeting Exceptional service in the national interest is a multi-program laboratory
More informationA Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures Georgios Rokos 1, Gerard Gorman 2, and Paul H J Kelly 1 1 Software Peroformance Optimisation Group, Department of
More informationParallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU
Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU M. Naumov, P. Castonguay and J. Cohen NVIDIA, 2701 San Tomas Expressway, Santa Clara, CA 95050 Abstract In this technical
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationPreconditioning Linear Systems Arising from Graph Laplacians of Complex Networks
Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Kevin Deweese 1 Erik Boman 2 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms
More informationBFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems
20 May 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationMassively Parallel Graph Analytics
Massively Parallel Graph Analytics Manycore graph processing, distributed graph layout, and supercomputing for graph analytics George M. Slota 1,2,3 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia
More informationDax: A Massively Threaded Visualiza5on and Analysis Toolkit for Extreme Scale
Dax: A Massively Threaded Visualiza5on and Analysis Toolkit for Extreme Scale GPU Technology Conference March 26, 2014 Kenneth Moreland Sandia Na5onal Laboratories Robert Maynard Kitware, Inc. Sandia National
More informationImplementing Many-Body Potentials for Molecular Dynamics Simulations
Official Use Only Implementing Many-Body Potentials for Molecular Dynamics Simulations Using large scale clusters for higher accuracy simulations. Christian Trott, Aidan Thompson Unclassified, Unlimited
More informationFoundation of Parallel Computing- Term project report
Foundation of Parallel Computing- Term project report Shobhit Dutia Shreyas Jayanna Anirudh S N (snd7555@rit.edu) (sj7316@rit.edu) (asn5467@rit.edu) 1. Overview: Graphs are a set of connections between
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationPULP: Fast and Simple Complex Network Partitioning
PULP: Fast and Simple Complex Network Partitioning George Slota #,* Kamesh Madduri # Siva Rajamanickam * # The Pennsylvania State University *Sandia National Laboratories Dagstuhl Seminar 14461 November
More informationPuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania
More informationGPU accelerated maximum cardinality matching algorithms for bipartite graphs
GPU accelerated maximum cardinality matching algorithms for bipartite graphs Bora Uçar CNRS and LIP, ENS Lyon, France Euro-Par 2013, 26 30 August, 2013, Aachen, Germany Joint work with: Mehmet Deveci Ümit
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationA Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers
A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers Erik G. Boman 1, Doruk Bozdağ 2, Umit Catalyurek 2,,AssefawH.Gebremedhin 3,, and Fredrik Manne 4 1 Sandia National Laboratories,
More informationEfficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra
More informationXtraPuLP. Partitioning Trillion-edge Graphs in Minutes. State University
XtraPuLP Partitioning Trillion-edge Graphs in Minutes George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 Karen Devine 2 1 Rensselaer Polytechnic Institute, 2 Sandia National Labs, 3 The Pennsylvania
More informationIrregular Graph Algorithms on Parallel Processing Systems
Irregular Graph Algorithms on Parallel Processing Systems George M. Slota 1,2 Kamesh Madduri 1 (advisor) Sivasankaran Rajamanickam 2 (Sandia mentor) 1 Penn State University, 2 Sandia National Laboratories
More informationTiDA: High Level Programming Abstrac8ons for Data Locality Management
h#p://parcorelab.ku.edu.tr TiDA: High Level Programming Abstrac8ons for Data Locality Management Didem Unat, Muhammed Nufail Farooqi, Burak Bastem Koç University, Turkey Tan Nguyen, Weiqun Zhang, George
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationDistributed State Es.ma.on Algorithms for Electric Power Systems
Distributed State Es.ma.on Algorithms for Electric Power Systems Ariana Minot, Blue Waters Graduate Fellow Professor Na Li, Professor Yue M. Lu Harvard University, School of Engineering and Applied Sciences
More informationEnabling High Performance Computational Science through Combinatorial Algorithms
Enabling High Performance Computational Science through Combinatorial Algorithms Erik G. Boman 1, Doruk Bozdag 2, Umit V. Catalyurek 2, Karen D. Devine 1, Assefaw H. Gebremedhin 3, Paul D. Hovland 4, Alex
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationNetwork Coding: Theory and Applica7ons
Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra
More informationOp#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami
Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy GTC 2014 N. Henderson & K. Murakami The collabora#on Geant4 @ Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationarxiv: v1 [cs.dc] 6 Mar 2013
GPU accelerated maximum cardinality matching algorithms for bipartite graphs arxiv:1303.1379v1 [cs.dc] 6 Mar 2013 Mehmet Deveci 1,2, Kamer Kaya 1, Bora Uçar 3 1,4, Ümit V. Çatalyürek 1 Dept. Biomedical
More informationA Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard
A Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard Sandia Na*onal Laboratories Kitware, Inc. University of California at Davis
More informationGPU Cluster Computing. Advanced Computing Center for Research and Education
GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of
More informationPar$$oning Sparse Matrices
SIAM CSE 09 Minisymposium on Parallel Sparse Matrix Computa:ons and Enabling Algorithms March 2, 2009, Miami, FL Par$$oning Sparse Matrices Ümit V. Çatalyürek Associate Professor Biomedical Informa5cs
More informationOptimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning
Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,
More informationCOL 380: Introduc1on to Parallel & Distributed Programming. Lecture 1 Course Overview + Introduc1on to Concurrency. Subodh Sharma
COL 380: Introduc1on to Parallel & Distributed Programming Lecture 1 Course Overview + Introduc1on to Concurrency Subodh Sharma Indian Ins1tute of Technology Delhi Credits Material derived from Peter Pacheco:
More informationUsing the Cray Gemini Performance Counters
Photos placed in horizontal position with even amount of white space between photos and header Using the Cray Gemini Performance Counters 0 1 2 3 4 5 6 7 Backplane Backplane 8 9 10 11 12 13 14 15 Backplane
More informationAdvances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing
Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Erik G. Boman 1, Umit V. Catalyurek 2, Cédric Chevalier 1, Karen D. Devine 1, Ilya Safro 3, Michael M. Wolf
More informationLOOP PARALLELIZATION!
PROGRAMMING LANGUAGES LABORATORY! Universidade Federal de Minas Gerais - Department of Computer Science LOOP PARALLELIZATION! PROGRAM ANALYSIS AND OPTIMIZATION DCC888! Fernando Magno Quintão Pereira! fernando@dcc.ufmg.br
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationAsynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines
Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, Daniel Halperin University of Washington Modern Analy>cs Requires Itera>on Graph
More informationAr#ficial Intelligence
Ar#ficial Intelligence Advanced Searching Prof Alexiei Dingli Gene#c Algorithms Charles Darwin Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for
More informationEhsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,
More informationHeterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM
Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen ane.pekka.hynninen@nrel.gov NREL is a na*onal laboratory of the U.S. Department of Energy,
More informationA Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers
A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers Doruk Bozdağ 1, Umit Catalyurek 1, Assefaw H. Gebremedhin 2, Fredrik Manne 3, Erik G. Boman 4,andFüsun Özgüner 1 1 Ohio State
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationPor$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain
Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons
More informationHarnessing GPU speed to accelerate LAMMPS particle simulations
Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,
More informationDecision making for autonomous naviga2on. Anoop Aroor Advisor: Susan Epstein CUNY Graduate Center, Computer science
Decision making for autonomous naviga2on Anoop Aroor Advisor: Susan Epstein CUNY Graduate Center, Computer science Overview Naviga2on and Mobile robots Decision- making techniques for naviga2on Building
More informationDeterministic Parallel Graph Coloring with Symmetry Breaking
Deterministic Parallel Graph Coloring with Symmetry Breaking Per Normann, Johan Öfverstedt October 05 Abstract In this paper we propose a deterministic parallel graph coloring algorithm that enables Multi-Coloring
More informationAccelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU
Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Ryouhei Murooka, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1,
More informationRegister Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein
Register Alloca.on Deconstructed David Ryan Koes Seth Copen Goldstein 12th Interna+onal Workshop on So3ware and Compilers for Embedded Systems April 24, 12009 Register Alloca:on Problem unbounded number
More informationDISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION
DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION DORUK BOZDAĞ, ÜMİT V. ÇATALYÜREK, ASSEFAW H. GEBREMEDHIN, FREDRIK MANNE, ERIK G. BOMAN, AND
More informationStarchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees
Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University *Sta7s7cal Tuning
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationDistributed-Memory Parallel Algorithms for Matching and Coloring
Distributed-Memory Parallel Algorithms for Matching and Coloring Ümit V. Çatalyürek, Florin Dobrian, Assefaw Gebremedhin, Mahantesh Halappanavar, Alex Pothen Depts. of Biomedial Informatics and Electrical
More informationPhysis: An Implicitly Parallel Framework for Stencil Computa;ons
Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on
More informationHPCGraph: Benchmarking Massive Graph Analytics on Supercomputers
HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers George M. Slota 1, Siva Rajamanickam 2, Kamesh Madduri 3 1 Rensselaer Polytechnic Institute 2 Sandia National Laboratories a 3 The Pennsylvania
More informationProfiling & Tuning Applica1ons. CUDA Course July István Reguly
Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,
More informationDanesh TaNi & Amit Amritkar
GenIDLEST Co- Design Danesh TaNi & Amit Amritkar Collaborators Wu- chun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Eric de Sturler, Kasia Swirydowicz Virginia Tech AFOSR- BRI Workshop Feb
More informationEarly Experiences with Trinity - The First Advanced Technology Platform for the ASC Program
Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program C.T. Vaughan, D.C. Dinge, P.T. Lin, S.D. Hammond, J. Cook, C. R. Trott, A.M. Agelastos, D.M. Pase, R.E. Benner,
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationOpenFOAM + GPGPU. İbrahim Özküçük
OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of
More informationA Script- Based Autotuning Compiler System to Generate High- Performance CUDA code
A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame Mo:va:on Challenges to programming the GPU
More informationc 2010 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 32, No. 4, pp. 2418 2446 c 2010 Society for Industrial and Applied Mathematics DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND RELATED PROBLEMS IN DERIVATIVE
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationParallel Implementation of Task Scheduling using Ant Colony Optimization
Parallel Implementaon of Task Scheduling using Ant Colony Opmizaon T. Vetri Selvan 1, Mrs. P. Chitra 2, Dr. P. Venkatesh 3 1 Thiagaraar College of Engineering /Department of Computer Science, Madurai,
More informationCompiler Optimization Intermediate Representation
Compiler Optimization Intermediate Representation Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology
More informationA Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown
A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook
More informationSoft GPGPUs for Embedded FPGAS: An Architectural Evaluation
Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts
More informationGASPP: A GPU- Accelerated Stateful Packet Processing Framework
GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis,
More informationWhat is Search For? CS 188: Ar)ficial Intelligence. Constraint Sa)sfac)on Problems Sep 14, 2015
CS 188: Ar)ficial Intelligence Constraint Sa)sfac)on Problems Sep 14, 2015 What is Search For? Assump)ons about the world: a single agent, determinis)c ac)ons, fully observed state, discrete state space
More informationGPUML: Graphical processors for speeding up kernel machines
GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of
More informationA performance portable implementation of HOMME via the Kokkos programming model
E x c e p t i o n a l s e r v i c e i n t h e n a t i o n a l i n t e re s t A performance portable implementation of HOMME via the Kokkos programming model L.Bertagna, M.Deakin, O.Guba, D.Sunderland,
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationUnstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node
Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node Keith Obenschain & Andrew Corrigan Laboratory for Computa;onal Physics and Fluid Dynamics Naval Research Laboratory Washington DC,
More informationParallel Distance-k Coloring Algorithms for Numerical Optimization
Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin 1, Fredrik Manne 1, and Alex Pothen 2 1 Department of Informatics, University of Bergen, N-5020 Bergen, Norway
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationExploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat
More informationA Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT
A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationParallel Distance-k Coloring Algorithms for Numerical Optimization
Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin Fredrik Manne Alex Pothen Abstract Matrix partitioning problems that arise in the efficient estimation of sparse
More informationUPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2
Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2 www.parallel.illinois.edu Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors
More informationIntro to Parallel Computing
Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel
More informationLecture 2 Data Cube Basics
CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1 Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao-
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationIntroduc)on to High Performance Compu)ng Advanced Research Computing
Introduc)on to High Performance Compu)ng Advanced Research Computing Outline What cons)tutes high performance compu)ng (HPC)? When to consider HPC resources What kind of problems are typically solved?
More information