Tree-based methods on GPUs

Size: px
Start display at page:

Download "Tree-based methods on GPUs"

Transcription

1 Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology Rush University Medical Center SIAM CS & E Miami, FL Mar 2, 2009 M. Knepley (UC) GPU SIAM 1 / 50

2 Outline Introduction 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 2 / 50

3 Introduction Scientific Computing Challenge How do we create reusable implementations which are also efficient? M. Knepley (UC) GPU SIAM 5 / 50

4 Introduction Scientific Computing Insight Structures are conserved, but tradeoffs change. M. Knepley (UC) GPU SIAM 6 / 50

5 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

6 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

7 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

8 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

9 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

10 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

11 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

12 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

13 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

14 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

15 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

16 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

17 Introduction Representation Hierarchy Divide the work into levels: Model Algorithm Implementation M. Knepley (UC) GPU SIAM 11 / 50

18 Introduction Representation Hierarchy Divide the work into levels: Spiral Project: Model Discrete Fourier Transform (DSP) Algorithm Implementation Fast Fourier Transform (SPL) C Implementation (SPL Compiler) M. Knepley (UC) GPU SIAM 11 / 50

19 Introduction Representation Hierarchy Divide the work into levels: FLAME Project: Model Abstract LA (PME/Invariants) Algorithm Implementation Basic LA (FLAME/FLASH) Scheduling (SuperMatrix) M. Knepley (UC) GPU SIAM 11 / 50

20 Introduction Representation Hierarchy Divide the work into levels: FEniCS Project: Model Navier-Stokes (FFC) Algorithm Implementation Finite Element (FIAT) Integration/Assembly (FErari) M. Knepley (UC) GPU SIAM 11 / 50

21 Introduction Representation Hierarchy Divide the work into levels: Treecodes: Model Kernels with decay (Coulomb) Algorithm Implementation Treecodes (PetFMM) Scheduling (PetFMM-GPU) M. Knepley (UC) GPU SIAM 11 / 50

22 Introduction Representation Hierarchy Divide the work into levels: Treecodes: Model Kernels with decay (Coulomb) Algorithm Implementation Treecodes (PetFMM) Scheduling (PetFMM-GPU) Each level demands a strong abstraction layer M. Knepley (UC) GPU SIAM 11 / 50

23 Spiral Introduction Spiral Team, Uses an intermediate language, SPL, and then generates C Works by circumscribing the algorithmic domain M. Knepley (UC) GPU SIAM 12 / 50

24 FLAME & FLASH Introduction Performance of the Matrix-Matrix Product (C=C+A*B) on GPU/CPU on S Algorithm-by-blocks on four T10 processors CUBLAS sgemm on a single T10 processor MKL sgemm on Intel Xeon QuadCore (4 cores) GFLOPS Matrix size Robert van de Geijn, FLAME is an Algorithm-By-Blocks interface FLASH/SuperMatrix is a runtime system M. Knepley (UC) GPU SIAM 13 / 50

25 Outline Short Introduction to FMM 1 Introduction 2 Short Introduction to FMM Spatial Decomposition Data Decomposition 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 14 / 50

26 Outline Short Introduction to FMM Spatial Decomposition 2 Short Introduction to FMM Spatial Decomposition Data Decomposition M. Knepley (UC) GPU SIAM 15 / 50

27 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

28 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

29 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

30 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

31 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

32 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

33 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

34 Tree Interface Short Introduction to FMM Spatial Decomposition locateblob(blob) Locate point in the tree fillneighbors() Compute the neighbor section findinteractionlist() Compute the interaction list cell section, allocate value section fillinteractionlist(level) Compute the interaction list value section fill(blobs) Compute the blob section dump() Produces a verifiable repesentation of the tree M. Knepley (UC) GPU SIAM 17 / 50

35 Outline Short Introduction to FMM Data Decomposition 2 Short Introduction to FMM Spatial Decomposition Data Decomposition M. Knepley (UC) GPU SIAM 18 / 50

36 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

37 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

38 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

39 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients Notice this is multiscale since data is divided at each level M. Knepley (UC) GPU SIAM 19 / 50

40 Outline Serial Implementation 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 20 / 50

41 Serial Implementation Evaluator Interface initializeexpansions(tree, blobinfo) Generate multipole expansions on the lowest level Requires loop over cells O(p) upwardsweep(tree) Translate multipole expansions to intermediate levels Requires loop over cells and children (support) O(p 2 ) downwardsweep(tree) Convert multipole to local expansions and translate local expansions on intermediate levels Requires loop over cells and parent (cone) O(p 2 ) M. Knepley (UC) GPU SIAM 21 / 50

42 Serial Implementation Evaluator Interface evaluateblobs(tree, blobinfo) Evaluate direct and local field interactions on lowest level Requires loop over cells and neighbors (in section) O(p 2 ) evaluate(tree, blobs, blobinfo) Calculate the complete interaction (multipole + direct) M. Knepley (UC) GPU SIAM 22 / 50

43 Serial Implementation Kernel Interface Method Description P2M(t) Multipole expansion coefficients L2P(t) Local expansion coefficients M2M(t) Multipole-to-multipole translation M2L(t) Multipole-to-local translation L2L(t) Local-to-local translation evaluate(blobs) Direct interaction Evaluator is templated over Kernel There are alternative kernel-independent methods kifmm3d M. Knepley (UC) GPU SIAM 23 / 50

44 Outline Complexity Analysis 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 24 / 50

45 Complexity Analysis Greengard & Gropp Analysis For a shared memory machine, T = a N P + b log 4 P + c N BP + d NB P + e(n, P) (1) 1 Initialize multipole expansions, finest local expansions, final sum 2 Reduction bottleneck 3 Translation and Multipole-to-Local 4 Direct interaction 5 Low order terms A Parallel Version of the Fast Multipole Method, L. Greengard and W.D. Gropp, Comp. Math. Appl., 20(7), M. Knepley (UC) GPU SIAM 25 / 50

46 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

47 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

48 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

49 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

50 Outline Multicore Computing 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 27 / 50

51 Question Multicore Computing What is the optimal number of particles per cell? Greengard & Gropp Minimize time and maximize parallel efficiency c B opt = d 30 Gumerov & Duraiswami Follow GG, but also try to consider memory access B opt 91, but instead, they choose 320 Heavily weights the N 2 part of the computation We propose to cover up the bottleneck with direct evaluations M. Knepley (UC) GPU SIAM 28 / 50

52 Problem Missing Concurrency Multicore Computing We can balance time in direct evaluation with idle time for small grids. The direct evaluation takes time d NB p Assume a single thread group works on the first L tree levels Thus, we need B b 4 L+1 p (2) d N in order to cover the bottleneck. In an upcoming publication, we show that this bound holds for all modern processors. M. Knepley (UC) GPU SIAM 29 / 50

53 Problem Missing Bandwidth Multicore Computing We can restructure the M2L to conserve bandwidth Matrix-free application of M2L Reorganize traversal to minimize bandwidth Old Pull in 27 interaction MEs, transform to LE, reduce New Pull in cell ME, transform to 27 interaction LEs, partially reduce M. Knepley (UC) GPU SIAM 30 / 50

54 Multicore Computing Matrix-Free M2L The M2L transformation applies the operator M ij = 1 i t (i+j+1) ( i + j j ) (3) Notice that the t exponent is constant along perdiagonals. Thus we divide by t at each perdiagonal calculate the C ij by the recurrence along each perdiagonal carefully formulate complex division (STL fails here) M. Knepley (UC) GPU SIAM 31 / 50

55 Outline An Interface for Multicore Programs 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 32 / 50

56 Outline An Interface for Multicore Programs FLASH 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 33 / 50

57 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

58 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

59 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

60 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

61 An Interface for Multicore Programs Cholesky Factorization FLASH FLA_Part_2x2(A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL); while(fla_object_length(atl) < FLA_Object_length(A)) { FLA_Repart_2x2_to_3x3( ATL, ATR, &A00, &A01, &A02, &A10, &A11, &A12, ABL, ABR, &A20, &A21, &A22, 1, 1, FLA_BR); FLASH_Chol(FLA_UPPER_TRIANGULAR, A11); FLASH_Trsm(FLA_LEFT,FLA_UPPER_TRIANGULAR,FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A12); FLASH_Syrk(FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22); FLA_Cont_with_3x3_to_2x2( &ATL, &ATR, A00, A01, A02, A10, A11, A12, &ABL, &ABR, A20, A21, A22, FLA_TL); } FLA_Queue_exec(); M. Knepley (UC) GPU SIAM 35 / 50

62 Outline An Interface for Multicore Programs PetFMM 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 36 / 50

63 PetFMM-GPU An Interface for Multicore Programs PetFMM We break down sweep operations into Tasks Cell loops are now tiled Tasks are queued We can form a DAG since we know the dependence structure Scheduling is possible This asynchronous interface can enable Overlapping direct and multipole calculations Reorganizing the downward sweep Adaptive expansions M. Knepley (UC) GPU SIAM 37 / 50

64 GPU Classes An Interface for Multicore Programs PetFMM Section size() returns the number of values getfiberdimension(cell) returns the number of cell values restrict/update() retrieves and changes cell values clone/extract() converts between CPU and GPU objects Evaluator initializeexpansions() upwardsweep() downwardsweeptransform() downwardsweeptranslate() evaluateblobs() evaluate() M. Knepley (UC) GPU SIAM 38 / 50

65 GPU Classes An Interface for Multicore Programs PetFMM Section size() returns the number of values getfiberdimension(cell) returns the number of cell values restrict/update() retrieves and changes cell values clone/extract() converts between CPU and GPU objects Task Input data size Output data size Dependencies (future) TaskQueue Manages storage and offsets evaluate() M. Knepley (UC) GPU SIAM 38 / 50

66 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Transform Task cell block in cell and interaction list centers, interaction list multipole coeff out cell temp local coeff Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

67 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Transform Task cell block in cell and interaction list centers, cell multipole coeff out interaction list temp local coefficients Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

68 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Reduce Task cell block in interaction list temp local coefficients out cell temp local coefficients Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

69 Transform Task An Interface for Multicore Programs PetFMM Shifts interaction cell multipole expansion to cell local expansion Add a task for each interaction cell All tasks with same origin are merged Local memory: 2 (p+1) blocksize (Pascal) + 2 p blocksize (LE) + 2 p (ME) 8 terms 4416 bytes 17 terms 9096 bytes Execution 1 block per ME Each thread reads a section of ME and the MEcenter Each thread computes an LE separately Each thread writes LE to separate global location M. Knepley (UC) GPU SIAM 40 / 50

70 Reduce Task An Interface for Multicore Programs PetFMM Add up local expansion contributions from each interaction cell Add a task for each cell Local memory: 2*terms (LE) 8 terms 64 bytes 17 terms 136 bytes Execution 1 block per output LE Each thread reads a section of input LE Each thread adds to shared output LE M. Knepley (UC) GPU SIAM 41 / 50

71 What s Important? Conclusions Interface improvements bring concrete benefits Facilitated code reuse Serial code was largely reused Test infrastructure completely reused Opportunites for performance improvement Overlapping computations Better task scheduling Expansion of capabilities Could now combine distributed and multicore implementations Could replace local expansions with cheaper alternatives M. Knepley (UC) GPU SIAM 42 / 50

72 Distributed FMM Parallel Tree Implementation Divide tree into a root and local trees Distribute local trees among processes Provide communication pattern for local sections (overlap) Both neighbor and interaction list overlaps Sieve generates MPI from high level description M. Knepley (UC) GPU SIAM 43 / 50

73 Distributed FMM Parallel Tree Implementation How should we distribute trees? Multiple local trees per process allows good load balance Partition weighted graph Minimize load imbalance and communication Computation estimate: Leaf N i p (P2M) + n I p 2 (M2L) + N i p (L2P) + 3 d Ni 2 (P2P) Interior n cp 2 (M2M) + n I p 2 (M2L) + n cp 2 (L2L) Communication estimate: Diagonal n c(l k 1) Lateral 2 d 2m(L k 1) 1 2 m 1 for incidence dimesion m Leverage existing work on graph partitioning ParMetis M. Knepley (UC) GPU SIAM 44 / 50

74 Distributed FMM Parallel Tree Implementation Why should a good partition exist? Shang-hua Teng, Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation, SIAM J. Sci. Comput., 19(2), Good partitions exist for non-uniform distributions 2D O ( n(log n) 3/2) edgecut 3D O ( n 2/3 (log n) 4/3) edgecut As scalable as regular grids As efficient as uniform distributions ParMetis will find a nearly optimal partition M. Knepley (UC) GPU SIAM 45 / 50

75 Distributed FMM Parallel Tree Implementation Will ParMetis find it? George Karypis and Vipin Kumar, Analysis of Multilevel Graph Partitioning, Supercomputing, Good partitions exist for non-uniform distributions 2D C i = 1.24 i C 0 for random matching 3D C i = 1.21 i C 0?? for random matching 3D proof needs assurance that averge degree does not increase Efficient in practice M. Knepley (UC) GPU SIAM 46 / 50

76 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

77 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

78 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

79 Distributed FMM Parallel Tree Interface fillneighbors() Compute neighbor overlap, and send neighbors findinteractionlist() Compute the interaction list overlap fillinteractionlist(level) Complete and copy into local interaction sections fill(blobs) Now must scatter blobs to local trees Uses scatterblobs() and gatherblobs() M. Knepley (UC) GPU SIAM 48 / 50

80 Distributed FMM Parallel Data Movement 1 Complete neighbor section 2 Upward sweep 1 Upward sweep on local trees 2 Gather to root tree 3 Upward sweep on root tree 3 Complete interaction list section 4 Downward sweep 1 Downward sweep on root tree 2 Scatter to local trees 3 Downward sweep on local trees M. Knepley (UC) GPU SIAM 49 / 50

81 Distributed FMM Parallel Evaluator Interface initializeexpansions(local trees, blobinfo) Evaluate each local tree upwardsweep(local trees, partition, root tree) Evaluate each local tree and then gather to root tree downwardsweep(local trees, partition, root tree) Scatter from root tree and then evaluate each local tree evaluateblobs(local trees, blobinfo) Evaluate on all local trees evaluate(tree, blobs, blobinfo) Identical M. Knepley (UC) GPU SIAM 50 / 50

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Theoretical Foundations

Theoretical Foundations Theoretical Foundations Dmitry Karpeev 1,2, Matthew Knepley 1,2, and Robert Kirby 3 1 Mathematics and Computer Science Division 2 Computation Institute Argonne National Laboratory University of Chicago

More information

Stokes Preconditioning on a GPU

Stokes Preconditioning on a GPU Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical

More information

Implementation for Scientific Computing: FEM and FMM

Implementation for Scientific Computing: FEM and FMM Implementation for Scientific Computing: FEM and FMM Matthew Knepley Computation Institute University of Chicago Department of Mathematics Tufts University March 12, 2010 M. Knepley (UC) SC Tufts 1 / 121

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

Incorporation of Multicore FEM Integration Routines into Scientific Libraries

Incorporation of Multicore FEM Integration Routines into Scientific Libraries Incorporation of Multicore FEM Integration Routines into Scientific Libraries Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology Rush University

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees

More information

GPU-based Distributed Behavior Models with CUDA

GPU-based Distributed Behavior Models with CUDA GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Kernel Independent FMM

Kernel Independent FMM Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

Algorithms and architectures for N-body methods

Algorithms and architectures for N-body methods Algorithms and architectures for N-body methods George Biros XPS-DSD CCF-1337393 A2MA - Algorithms and Architectures for Multiresolution Applications 2014-2017 Outline Significance Sketch or algorithm

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Interdisciplinary practical course on parallel finite element method using HiFlow 3

Interdisciplinary practical course on parallel finite element method using HiFlow 3 Interdisciplinary practical course on parallel finite element method using HiFlow 3 E. Treiber, S. Gawlok, M. Hoffmann, V. Heuveline, W. Karl EuroEDUPAR, 2015/08/24 KARLSRUHE INSTITUTE OF TECHNOLOGY -

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications Jingfang Huang Department of Mathematics UNC at Chapel Hill Fundamentals of Fast Multipole (type) Method Fundamentals: Low

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Enas Yunis, Rio Yokota and Aron Ahmadia King Abdullah University of Science and Technology 7 KAUST, Thuwal, KSA 3955-69 {enas.yunis,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Scalability of Uintah Past Present and Future

Scalability of Uintah Past Present and Future DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John Schmidt,

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

Hierarchical Decompositions of Kernel Matrices

Hierarchical Decompositions of Kernel Matrices Hierarchical Decompositions of Kernel Matrices Bill March UT Austin On the job market! Dec. 12, 2015 Joint work with Bo Xiao, Chenhan Yu, Sameer Tharakan, and George Biros Kernel Matrix Approximation Inputs:

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

Multi-core Computing Lecture 2

Multi-core Computing Lecture 2 Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Anna Morajko.

Anna Morajko. Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance

More information

Guillimin HPC Users Meeting January 13, 2017

Guillimin HPC Users Meeting January 13, 2017 Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Chapter 11 Search Algorithms for Discrete Optimization Problems

Chapter 11 Search Algorithms for Discrete Optimization Problems Chapter Search Algorithms for Discrete Optimization Problems (Selected slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003.

More information

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes Jung-Han Kimn 1 and Blaise Bourdin 2 1 Department of Mathematics and The Center for Computation and

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh Week 3: MPI Day 04 :: Domain decomposition, load balancing, hybrid particlemesh methods Domain decompositon Goals of parallel computing Solve a bigger problem Operate on more data (grid points, particles,

More information

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Terascale on the desktop: Fast Multipole Methods on Graphical Processors Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

A pedestrian introduction to fast multipole methods

A pedestrian introduction to fast multipole methods SCIENCE CHIN Mathematics. RTICLES. May 2012 Vol. 55 No. 5: 1043 1051 doi: 10.1007/s11425-012-4392-0 pedestrian introduction to fast multipole methods YING Lexing Department of Mathematics and ICES, University

More information

Fast-multipole algorithms moving to Exascale

Fast-multipole algorithms moving to Exascale Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George

More information

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers Yanhong Yuan and Prithviraj Banerjee Department of Electrical and Computer Engineering

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Lecture 4: Principles of Parallel Algorithm Design (part 3) Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information