Parallel Graph Coloring For Many- core Architectures

Size: px
Start display at page:

Download "Parallel Graph Coloring For Many- core Architectures"

Transcription

1 Parallel Graph Coloring For Many- core Architectures Mehmet Deveci, Erik Boman, Siva Rajamanickam Sandia Na;onal Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

2 Outline Performance portable graph coloring algorithms and implementa;ons that run on mul;- and many- core plaborms (e.g. Xeon Phi, GPUs) Using Kokkos Library Study of implementa;on issues for vertex- based methods A new edge- based coloring algorithm Empirical comparisons on a range of graphs on both Xeon Phi and GPU Demonstrated impact of coloring on real solver applica;ons M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 2

3 Problem Defini;on Given a graph G = ( V, E), With ver;ces v V Edges (v 1, v 2 ) E v 1, v 2 V Image courtesy of Sariyuce, Saule, Catalyurek, SIAM PP, 2012 Distance- 1 graph coloring: assigns colors to ver;ces so that each vertex have different color from all of its neighbors C :V N C(v 1 ) C(v 2 ) for all (v 1, v 2 ) E The dis;nct number of colors assigned to ver;ces: C Graph coloring problem that minimizes C is NP- Hard [Zuckerman, 2006] Applica;ons: Parallel computa;on, Jacobian computa;on, Register alloca;ons 3

4 Graph Coloring Heuris;cs Simple greedy heuris;cs oien obtain near op;mal solu;ons First- fit [Matula, 1972], with O(V + E ) Keeps forbidden array to store the colors of neighbors Obtains C δ +1 where δ max degree in the graph Parallel Implementa;ons Specula;ve Method [Gebremedhin and Manne,2000], [Bozdag, 2008] [Jones and Plassmann, 1993] paralleliza;on of [Luby, 1986] Distributed Implementa;ons: [Catalyurek,2012] Hybrid MPI+OpenMP Implementa;ons: [Sariyuce, 2012] 4

5 Many- core Coloring Heuris;cs Yet, only few work has been done on many- core architectures Xeon Phi: Specula;ve Method (IPGC) [Saule, 2012] Specula;vely color ver;ces in each threads Detect conflicts due to race condi;ons and recolor them GPUs: Nvidia cusparse: [Naumov, 2015] Relaxa;on of Jones and Plassmann (JP) based on the independent sets Highly parallel, runs fast But the number of colors found are usually very high 5

6 Vertex- Based Coloring Algorithms Minimum atomic work are ver;ces, therefore 1 vertex is owned by a single thread: IPGC, cusparse Some implementa;on details are oien ignored e.g. the requirement of thread private Forbidden array O(δ) can be a problem on highly irregular graphs, or when number of threads are high Op;miza;on: Limit the size of Forbidden array e.g. with constant size 32 (called VB) Traverse the adjacency mul;ple ;mes first for the ver;ces with colors 1-32, then On GPUs this array can be stored in slow local memory Use the bits of single int (called VBBIT) Conversion to back and forth to bit representa;on Stored in registers on GPU rather than slow memory 6

7 Edge- Based Coloring (EB) Threads traverses edges, only par;al informa;on about the vertex 3 phases Assign Colors Vertex- based Detect Conflicts Edge- based Forbid Colors Edge- Based Per vertex Forbidded Sets 7

8 Edge- Based Coloring - Round - 2 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 8

9 Edge- Based Coloring - Round - 3 Threads traverses edges, only par;al informa;on 3 phases Assign Colors Detect Conflicts Forbid Colors 9

10 Edge- Based Coloring Problems Memory: 1 Forbidden array per vertex. O(V δ ) Convergence issues Conflicts occurs even when there is no race condi;ons High number of edge traversals E >> V 10

11 EB Op;miza;ons- Memory O( 2 V ) Represent colors with 2 integers Set bit index denotes the color e.g. Color- 10 = To represent more than 32 colors use color sets (CS) CS(0) = [1,32], CS(1) = [33,64] Ini;ally CS(v) = 0,Color(v) = 0 Each vertex s Forbidden array can be represented with a single integer Each itera;on, a thread writes to Forbidden with an atomic or only if CS(v 1 ) = CS(v 2 ) Assign Colors - > O(1) with 2 s complement. No available color in current CS increment CS and seek for colors in next CS 11

12 EB Op;miza;ons- Convergence Coloring decisions ignore the colors assigned in the current itera;on, causing a lot of conflicts regardless of race condi;ons Tenta;ve coloring for the edges that have 2 uncolored ver;ces Threads tenta;vely color one of them, and Forbid that color on the other end. 12

13 EB Op;miza;ons Edge Filtering E >> V, therefore it is necessary to keep worklist during the algorithm execu;on An edge can be filtered out if: If both ends have colored ver;ces See paper for other cases Tried atomic or parallel prefix sum to create the work list PPS was beyer, which is used for the rest of the experiments 13

14 Experiments Publicly available implementa;ons are under KokkosKernels (TpetraKernels) package of Trilinos Experiments on compton Intel Xeon Phi (MIC), and shannon (GPU) clusters of Sandia Na;onal Labs KNC- MIC : 57 cores, 4 hyperthreads at 1.1 GHz with 6 GB (always run with 228 threads), icc Compared against IPGC [Saule, 2012] NVIDIA Tesla K20X GPUs : CC 3.5, 6GB Memory, Cuda Compared against cusparse Experiments study Performance of graph coloring coloring execu;on ;me Coloring quality - number of colors The effect of coloring on the execu;on ;me of a real applica;on 14

15 Experiment Matrices As a measure for how irregular the graph is 15

16 Coloring on GPU Speedup&w.r.t.&cuSPARSE& 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 63.87& 1.22& 11.01& 1.75& 2.38& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& Speedup! Higher is better 1.20$ EB& VB& VBBIT& cusparse& 0.66& 0.41& 0.19& 0.44& 0.64& Irregularity ", VB variants Less coalesced Load imbalance Exec;me EB < cusparse for irregular graphs. But its quality is always beyer. Upto 48x less colors on circuit VBBIT helps w.r.t VB Normalized$#colors$w.r.t.$ cusparse$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better 0.32$ 0.39$ EB$ VBBIT$ 0.19$ 0.33$ VB$ cusparse$ 0.34$ EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT Norm # colors w.r.t SequenAal cusparse 3.57 EB 1.06 VB 1.16 VBBIT

17 Coloring on MIC Speedup%w.r.t.%IPGC% 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% Normalized$#colors$w.r.t.$IPGC$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% Speedup! Higher is better 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ EB$ VB$ VBBIT$ IPGC$ 0.99$ EB% VB% VBBIT% IPGC% 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ Normalized # colors! Lower is better EB is worse than VBs MIC has less threads, it is more forgiving to load imbalances VB is better than IPGC: better caching VB is worse on graphs with lots of colors VBBIT does not help on MIC. Caching vs BIT ops Speedup of VB 1.11 w.r.t IPGC 4.68 w.r.t EB Norm # colors w.r.t SequenAal IPGC 1.13 EB 1.05 VB 1.14 VBBIT

18 Mul;- Threaded Gauss Seidel 15% Implemented Conjugated Gradient 10% Precondi;oner: mul;- threaded Gauss- Seidel 5% Very sequen;al algorithm 0% Coloring to find independent rows Then opera;ons can be done in parallel for independent rows More colors! more synchroniza;on, less work in parallel regions Run on GPUs with coloring results of EB and cusparse for 7 Problems: Overall:1.39 and 1.17 speedups w.r.t. cusparse on Gauss- Seidel and overall solve ;me. 20% EB% cusparse% 6.84% 13.3% MT.GS% 12.17% 18.05% Overall% 1.94 and 1.48 speedups on circuit 18

19 Summary & Future Direc;ons Proposed a new parallel edge- based graph coloring algorithm Proposed several improvements to the tradi;onal specula;ve vertex- based algorithm Showed that edge- based method is usually faster than vertex- based ones on GPUs but it is slower on the Xeon Phi Showed that edge- based algorithm obtains beyer quality colorings than other variants up to 48x less colors than cusparse Future: Distance- 2 coloring 19

20 For more informa;on KokkosKernels: Download through Trilinos: hyp://trilinos.org Public git repository: hyp://github.com/trilinos For more informa;on: Thanks to: NNSA ASC program DOE ASCR SciDAC FASTMath Ins;tute ATDM Paper: M. Deveci, E.G. Boman, K.D. Devine, and S. Rajamanickam, Parallel Graph Coloring for Manycore Architectures, IPDPS 2016, to appear 20

21 References D. Zuckerman, Linear degree extractors and the inapproximability of max clique and chroma;c number, in Proc 38th ACM Symp Theory of Compu;ng, 2006, pp D. W. Matula, G. Marble, and J. Isaacson, Graph coloring algorithms, in Graph Theory and Compu;ng, R. Read, Ed. Academic Press, 1972, pp A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, Colpack: Soiware for graph coloring and related problems in scien;fic compu;ng, ACM Trans Math Soiw, vol. 40, no. 1, pp. 1 31, M. T. Jones and P. Plassmann, A parallel graph coloring heuris;c, SIAM J Sci Comput, vol. 14, no. 3, pp , M. Luby, A simple parallel algorithm for the maximal independent set problem, SIAM J Comput, vol. 15, no. 4, pp , A. H. Gebremedhin and F. Manne, Scalable parallel graph coloring algorithms, Concurrency: Prac;ce & Experience, vol. 12, no. 12, pp , D. Bozdag, A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Catalyurek, A framework for scalable greedy coloring on distributed memory parallel computers, J Parallel Distrib Comp, vol. 68, no. 4, pp , U. V. Catalyurek, J. Feo, A. H. Gebremedhin, M. Halappanavar, and A. Pothen, Graph coloring algorithms for mul;- core and massively mul;threaded architectures, Parallel Compu;ng, vol. 38, no. 10, pp , A. E. Sariyuce, E. Saule, and U. V. C ataly urek, Scalable hybrid implementa;on of graph coloring using MPI and OpenMP, in Proc IEEE 26th Internat Parallel & Distrib Proc Symp Workshops, 2012, pp E. Saule and U. V. Catalyurek, An early evalua;on of the scalability of graph algorithms on the Intel MIC architecture, in Proc 26th IEEE Internat Parallel Distrib Proc Symp Workshops, 2012, pp M. Naumov, P. Castonguay, and J. Cohen, Parallel graph coloring with applica;ons to the incomplete- LU factoriza;on on the GPU, NVIDIA, Tech. Rep.,

22 Overall Performance 10" 10" #"of"graph"instances" 8" 6" 4" 2" 8" 6" 4" 2" cusparse" IPGC" VB" VB"BIT" EB"PPS" 0" 0" 1" 1.5" 2" 2.5" 3" 1" 1.5" 2" 2.5" 3" GPU MIC Overall: EB has 1.49 speedup w.r.t cusparse, and 5.36 w.r.t VBBIT on GPU VB has 1.11 speedup w.r.t IPGC, 4.68 w.r.t EB on MIC 22

23 Scaling 350" 100" Time"(ms)"per"million"edge" 300" 250" 200" 150" 100" 50" cusparse" VB" EB"PPS" VB"BIT"EF" IPGC" VB"BIT" VB"EF" 80" 60" 40" 20" 0" 0" 50" 100" 150" #"edges" Millions" GPU 0" 0" 50" 100" 150" #"edges" Millions" MIC 23

24 Overall Performance GEOMEAN 24

25 Coloring Results- GEOMEAN Average number of colors of 5 execu;on by each algorithm 25

26 Coloring Time Speedup&w.r.t.&cuSPARSE& Speedup%w.r.t.%IPGC% 64.00& 32.00& 16.00& 8.00& 4.00& 2.00& 1.00& 0.50& 0.25& 0.13& 0.06& 0.03& 0.02& 0.01& 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% EB& VB& VBBIT& cusparse& 63.87& 1.22& 11.01& 1.75& 2.38& 0.66& 0.41& 0.19& 0.44& 0.64& circuit& indochina& kron21& livejournal&hollywood& audikw& rgg24& europe& Bump& Queen& GPU EB% VB% VBBIT% IPGC% 2.09% 0.26% 0.95% 1.08% 1.02% 1.49% 1.28% 0.84% 1.60% 1.62% circuit% indochina% kron21% livejournal%hollywood% audikw% rgg24% europe% Bump% Queen% MIC As the graphs becomes more irregular, VB algs: Less coalesced access Load imbalance EB on GPU 1.49 w.r.t cusparse 5.36 w.r.t VBBIT VB on MIC 1.11 w.r.t IPGC 4.68 w.r.t EB MIC has less threads, more forgiving to thread load-imbalances 26

27 Coloring Quality 1.20$ Normalized$#colors$w.r.t.$ cusparse$ Normalized$#colors$w.r.t.$IPGC$ 0.80$ 0.60$ 0.40$ 0.20$ 0.00$ 1.20$ 0.80$ 0.60$ 0.40$ 0.20$ EB$ VB$ VBBIT$ cusparse$ 0.02$ 0.97$ 0.19$ 0.59$ 0.95$ 0.32$ 0.39$ 0.19$ 0.33$ 0.34$ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ 0.87$ 1.14$ 0.79$ 1.03$ 1.03$ GPU EB$ VB$ VBBIT$ IPGC$ 0.99$ 0.86$ 1.01$ 1.02$ 1.04$ 0.96$ 0.96$ 0.94$ 1.01$ 0.96$ 0.90$ 0.94$ 0.94$ Geometric mean of # colors normalized w.r.t sequential GPU MIC cusparse 3.57 IPGC 1.13 EB 1.06 EB 1.05 VB 1.16 VB 1.14 VBBIT 1.16 VBBIT $ circuit$ indochina$ kron21$ livejournal$hollywood$ audikw$ rgg24$ europe$ Bump$ Queen$ MIC 27

28 Mul;- Threaded Gauss Seidel 1.84, 1.37 speedups on SGS, PCG 1.29, 1.07 speedups 1.25, 1.09 speedups 1.94, 1.48 speedups 1.32, 1.11 speedups 1.23, 1.07 speedups 1.10, 1.08 speedups 28

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons Assefaw Gebremedhin Purdue University (Star/ng August 2014, Washington State University School of Electrical Engineering

More information

Scalable Hybrid Implementation of Graph Coloring using MPI and OpenMP

Scalable Hybrid Implementation of Graph Coloring using MPI and OpenMP Scalable Hybrid Implementation of Graph Coloring using MPI and OpenMP Ahmet Erdem Sarıyüce, Erik Saule, and Ümit V. Çatalyürek Department of Biomedical Informatics Department of Computer Science and Engineering

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

A Push- Relabel- Based Maximum Cardinality Bipar9te Matching Algorithm on GPUs

A Push- Relabel- Based Maximum Cardinality Bipar9te Matching Algorithm on GPUs A Push- Relabel- Based Maximum Cardinality Biparte Matching Algorithm on GPUs Mehmet Deveci,, Kamer Kaya, Bora Uçar, and Ümit V. Çatalyürek, Dept. of Biomedical InformaDcs, The Ohio State University Dept.

More information

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

Hypergraph Sparsifica/on and Its Applica/on to Par//oning Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Simple Parallel Biconnectivity Algorithms for Multicore Platforms Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Oh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories

Oh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories Photos placed in horizontal posi1on with even amount of white space between photos and header Oh, $#*@! Exascale! The effect of emerging architectures on scien1fic discovery Ultrascale Visualiza1on Workshop,

More information

War Stories : Graph Algorithms in GPUs

War Stories : Graph Algorithms in GPUs SAND2014-18323PE War Stories : Graph Algorithms in GPUs Siva Rajamanickam(SNL) George Slota, Kamesh Madduri (PSU) FASTMath Meeting Exceptional service in the national interest is a multi-program laboratory

More information

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures Georgios Rokos 1, Gerard Gorman 2, and Paul H J Kelly 1 1 Software Peroformance Optimisation Group, Department of

More information

Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU

Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU M. Naumov, P. Castonguay and J. Cohen NVIDIA, 2701 San Tomas Expressway, Santa Clara, CA 95050 Abstract In this technical

More information

MPI & OpenMP Mixed Hybrid Programming

MPI & OpenMP Mixed Hybrid Programming MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why

More information

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Kevin Deweese 1 Erik Boman 2 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms

More information

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems 20 May 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

Massively Parallel Graph Analytics

Massively Parallel Graph Analytics Massively Parallel Graph Analytics Manycore graph processing, distributed graph layout, and supercomputing for graph analytics George M. Slota 1,2,3 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia

More information

Dax: A Massively Threaded Visualiza5on and Analysis Toolkit for Extreme Scale

Dax: A Massively Threaded Visualiza5on and Analysis Toolkit for Extreme Scale Dax: A Massively Threaded Visualiza5on and Analysis Toolkit for Extreme Scale GPU Technology Conference March 26, 2014 Kenneth Moreland Sandia Na5onal Laboratories Robert Maynard Kitware, Inc. Sandia National

More information

Implementing Many-Body Potentials for Molecular Dynamics Simulations

Implementing Many-Body Potentials for Molecular Dynamics Simulations Official Use Only Implementing Many-Body Potentials for Molecular Dynamics Simulations Using large scale clusters for higher accuracy simulations. Christian Trott, Aidan Thompson Unclassified, Unlimited

More information

Foundation of Parallel Computing- Term project report

Foundation of Parallel Computing- Term project report Foundation of Parallel Computing- Term project report Shobhit Dutia Shreyas Jayanna Anirudh S N (snd7555@rit.edu) (sj7316@rit.edu) (asn5467@rit.edu) 1. Overview: Graphs are a set of connections between

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

PULP: Fast and Simple Complex Network Partitioning

PULP: Fast and Simple Complex Network Partitioning PULP: Fast and Simple Complex Network Partitioning George Slota #,* Kamesh Madduri # Siva Rajamanickam * # The Pennsylvania State University *Sandia National Laboratories Dagstuhl Seminar 14461 November

More information

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania

More information

GPU accelerated maximum cardinality matching algorithms for bipartite graphs

GPU accelerated maximum cardinality matching algorithms for bipartite graphs GPU accelerated maximum cardinality matching algorithms for bipartite graphs Bora Uçar CNRS and LIP, ENS Lyon, France Euro-Par 2013, 26 30 August, 2013, Aachen, Germany Joint work with: Mehmet Deveci Ümit

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers

A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers Erik G. Boman 1, Doruk Bozdağ 2, Umit Catalyurek 2,,AssefawH.Gebremedhin 3,, and Fredrik Manne 4 1 Sandia National Laboratories,

More information

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra

More information

XtraPuLP. Partitioning Trillion-edge Graphs in Minutes. State University

XtraPuLP. Partitioning Trillion-edge Graphs in Minutes. State University XtraPuLP Partitioning Trillion-edge Graphs in Minutes George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 Karen Devine 2 1 Rensselaer Polytechnic Institute, 2 Sandia National Labs, 3 The Pennsylvania

More information

Irregular Graph Algorithms on Parallel Processing Systems

Irregular Graph Algorithms on Parallel Processing Systems Irregular Graph Algorithms on Parallel Processing Systems George M. Slota 1,2 Kamesh Madduri 1 (advisor) Sivasankaran Rajamanickam 2 (Sandia mentor) 1 Penn State University, 2 Sandia National Laboratories

More information

TiDA: High Level Programming Abstrac8ons for Data Locality Management

TiDA: High Level Programming Abstrac8ons for Data Locality Management h#p://parcorelab.ku.edu.tr TiDA: High Level Programming Abstrac8ons for Data Locality Management Didem Unat, Muhammed Nufail Farooqi, Burak Bastem Koç University, Turkey Tan Nguyen, Weiqun Zhang, George

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Distributed State Es.ma.on Algorithms for Electric Power Systems

Distributed State Es.ma.on Algorithms for Electric Power Systems Distributed State Es.ma.on Algorithms for Electric Power Systems Ariana Minot, Blue Waters Graduate Fellow Professor Na Li, Professor Yue M. Lu Harvard University, School of Engineering and Applied Sciences

More information

Enabling High Performance Computational Science through Combinatorial Algorithms

Enabling High Performance Computational Science through Combinatorial Algorithms Enabling High Performance Computational Science through Combinatorial Algorithms Erik G. Boman 1, Doruk Bozdag 2, Umit V. Catalyurek 2, Karen D. Devine 1, Assefaw H. Gebremedhin 3, Paul D. Hovland 4, Alex

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Network Coding: Theory and Applica7ons

Network Coding: Theory and Applica7ons Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra

More information

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy GTC 2014 N. Henderson & K. Murakami The collabora#on Geant4 @ Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC

More information

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore

More information

arxiv: v1 [cs.dc] 6 Mar 2013

arxiv: v1 [cs.dc] 6 Mar 2013 GPU accelerated maximum cardinality matching algorithms for bipartite graphs arxiv:1303.1379v1 [cs.dc] 6 Mar 2013 Mehmet Deveci 1,2, Kamer Kaya 1, Bora Uçar 3 1,4, Ümit V. Çatalyürek 1 Dept. Biomedical

More information

A Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard

A Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard A Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard Sandia Na*onal Laboratories Kitware, Inc. University of California at Davis

More information

GPU Cluster Computing. Advanced Computing Center for Research and Education

GPU Cluster Computing. Advanced Computing Center for Research and Education GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of

More information

Par$$oning Sparse Matrices

Par$$oning Sparse Matrices SIAM CSE 09 Minisymposium on Parallel Sparse Matrix Computa:ons and Enabling Algorithms March 2, 2009, Miami, FL Par$$oning Sparse Matrices Ümit V. Çatalyürek Associate Professor Biomedical Informa5cs

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

COL 380: Introduc1on to Parallel & Distributed Programming. Lecture 1 Course Overview + Introduc1on to Concurrency. Subodh Sharma

COL 380: Introduc1on to Parallel & Distributed Programming. Lecture 1 Course Overview + Introduc1on to Concurrency. Subodh Sharma COL 380: Introduc1on to Parallel & Distributed Programming Lecture 1 Course Overview + Introduc1on to Concurrency Subodh Sharma Indian Ins1tute of Technology Delhi Credits Material derived from Peter Pacheco:

More information

Using the Cray Gemini Performance Counters

Using the Cray Gemini Performance Counters Photos placed in horizontal position with even amount of white space between photos and header Using the Cray Gemini Performance Counters 0 1 2 3 4 5 6 7 Backplane Backplane 8 9 10 11 12 13 14 15 Backplane

More information

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Erik G. Boman 1, Umit V. Catalyurek 2, Cédric Chevalier 1, Karen D. Devine 1, Ilya Safro 3, Michael M. Wolf

More information

LOOP PARALLELIZATION!

LOOP PARALLELIZATION! PROGRAMMING LANGUAGES LABORATORY! Universidade Federal de Minas Gerais - Department of Computer Science LOOP PARALLELIZATION! PROGRAM ANALYSIS AND OPTIMIZATION DCC888! Fernando Magno Quintão Pereira! fernando@dcc.ufmg.br

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines Jingjing Wang, Magdalena Balazinska, Daniel Halperin University of Washington Modern Analy>cs Requires Itera>on Graph

More information

Ar#ficial Intelligence

Ar#ficial Intelligence Ar#ficial Intelligence Advanced Searching Prof Alexiei Dingli Gene#c Algorithms Charles Darwin Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for

More information

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,

More information

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM 25th March, GTC 2014, San Jose CA AnE- Pekka Hynninen ane.pekka.hynninen@nrel.gov NREL is a na*onal laboratory of the U.S. Department of Energy,

More information

A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers

A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers Doruk Bozdağ 1, Umit Catalyurek 1, Assefaw H. Gebremedhin 2, Fredrik Manne 3, Erik G. Boman 4,andFüsun Özgüner 1 1 Ohio State

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons

More information

Harnessing GPU speed to accelerate LAMMPS particle simulations

Harnessing GPU speed to accelerate LAMMPS particle simulations Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,

More information

Decision making for autonomous naviga2on. Anoop Aroor Advisor: Susan Epstein CUNY Graduate Center, Computer science

Decision making for autonomous naviga2on. Anoop Aroor Advisor: Susan Epstein CUNY Graduate Center, Computer science Decision making for autonomous naviga2on Anoop Aroor Advisor: Susan Epstein CUNY Graduate Center, Computer science Overview Naviga2on and Mobile robots Decision- making techniques for naviga2on Building

More information

Deterministic Parallel Graph Coloring with Symmetry Breaking

Deterministic Parallel Graph Coloring with Symmetry Breaking Deterministic Parallel Graph Coloring with Symmetry Breaking Per Normann, Johan Öfverstedt October 05 Abstract In this paper we propose a deterministic parallel graph coloring algorithm that enables Multi-Coloring

More information

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Ryouhei Murooka, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1,

More information

Register Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein

Register Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein Register Alloca.on Deconstructed David Ryan Koes Seth Copen Goldstein 12th Interna+onal Workshop on So3ware and Compilers for Embedded Systems April 24, 12009 Register Alloca:on Problem unbounded number

More information

DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION

DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION DORUK BOZDAĞ, ÜMİT V. ÇATALYÜREK, ASSEFAW H. GEBREMEDHIN, FREDRIK MANNE, ERIK G. BOMAN, AND

More information

Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees

Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees Starchart*: GPU Program Power/Performance Op7miza7on Using Regression Trees Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University *Sta7s7cal Tuning

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

Predic've Modeling in a Polyhedral Op'miza'on Space

Predic've Modeling in a Polyhedral Op'miza'on Space Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3

More information

Distributed-Memory Parallel Algorithms for Matching and Coloring

Distributed-Memory Parallel Algorithms for Matching and Coloring Distributed-Memory Parallel Algorithms for Matching and Coloring Ümit V. Çatalyürek, Florin Dobrian, Assefaw Gebremedhin, Mahantesh Halappanavar, Alex Pothen Depts. of Biomedial Informatics and Electrical

More information

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Physis: An Implicitly Parallel Framework for Stencil Computa;ons Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on

More information

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers George M. Slota 1, Siva Rajamanickam 2, Kamesh Madduri 3 1 Rensselaer Polytechnic Institute 2 Sandia National Laboratories a 3 The Pennsylvania

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

Danesh TaNi & Amit Amritkar

Danesh TaNi & Amit Amritkar GenIDLEST Co- Design Danesh TaNi & Amit Amritkar Collaborators Wu- chun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Eric de Sturler, Kasia Swirydowicz Virginia Tech AFOSR- BRI Workshop Feb

More information

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program C.T. Vaughan, D.C. Dinge, P.T. Lin, S.D. Hammond, J. Cook, C. R. Trott, A.M. Agelastos, D.M. Pase, R.E. Benner,

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame Mo:va:on Challenges to programming the GPU

More information

c 2010 Society for Industrial and Applied Mathematics

c 2010 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 32, No. 4, pp. 2418 2446 c 2010 Society for Industrial and Applied Mathematics DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE-2 COLORING AND RELATED PROBLEMS IN DERIVATIVE

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

Parallel Implementation of Task Scheduling using Ant Colony Optimization

Parallel Implementation of Task Scheduling using Ant Colony Optimization Parallel Implementaon of Task Scheduling using Ant Colony Opmizaon T. Vetri Selvan 1, Mrs. P. Chitra 2, Dr. P. Venkatesh 3 1 Thiagaraar College of Engineering /Department of Computer Science, Madurai,

More information

Compiler Optimization Intermediate Representation

Compiler Optimization Intermediate Representation Compiler Optimization Intermediate Representation Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology

More information

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook

More information

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

More information

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

GASPP: A GPU- Accelerated Stateful Packet Processing Framework GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis,

More information

What is Search For? CS 188: Ar)ficial Intelligence. Constraint Sa)sfac)on Problems Sep 14, 2015

What is Search For? CS 188: Ar)ficial Intelligence. Constraint Sa)sfac)on Problems Sep 14, 2015 CS 188: Ar)ficial Intelligence Constraint Sa)sfac)on Problems Sep 14, 2015 What is Search For? Assump)ons about the world: a single agent, determinis)c ac)ons, fully observed state, discrete state space

More information

GPUML: Graphical processors for speeding up kernel machines

GPUML: Graphical processors for speeding up kernel machines GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of

More information

A performance portable implementation of HOMME via the Kokkos programming model

A performance portable implementation of HOMME via the Kokkos programming model E x c e p t i o n a l s e r v i c e i n t h e n a t i o n a l i n t e re s t A performance portable implementation of HOMME via the Kokkos programming model L.Bertagna, M.Deakin, O.Guba, D.Sunderland,

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node Keith Obenschain & Andrew Corrigan Laboratory for Computa;onal Physics and Fluid Dynamics Naval Research Laboratory Washington DC,

More information

Parallel Distance-k Coloring Algorithms for Numerical Optimization

Parallel Distance-k Coloring Algorithms for Numerical Optimization Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin 1, Fredrik Manne 1, and Alex Pothen 2 1 Department of Informatics, University of Bergen, N-5020 Bergen, Norway

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat

More information

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Parallel Distance-k Coloring Algorithms for Numerical Optimization

Parallel Distance-k Coloring Algorithms for Numerical Optimization Parallel Distance-k Coloring Algorithms for Numerical Optimization Assefaw Hadish Gebremedhin Fredrik Manne Alex Pothen Abstract Matrix partitioning problems that arise in the efficient estimation of sparse

More information

UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2

UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2 Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2 www.parallel.illinois.edu Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors

More information

Intro to Parallel Computing

Intro to Parallel Computing Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel

More information

Lecture 2 Data Cube Basics

Lecture 2 Data Cube Basics CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1 Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao-

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Introduc)on to High Performance Compu)ng Advanced Research Computing

Introduc)on to High Performance Compu)ng Advanced Research Computing Introduc)on to High Performance Compu)ng Advanced Research Computing Outline What cons)tutes high performance compu)ng (HPC)? When to consider HPC resources What kind of problems are typically solved?

More information