Tree-based methods on GPUs

Size: px

Start display at page:

Download "Tree-based methods on GPUs"

Melissa Hampton
5 years ago
Views:

Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of

1 Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology Rush University Medical Center SIAM CS & E Miami, FL Mar 2, 2009 M. Knepley (UC) GPU SIAM 1 / 50

2 Outline Introduction 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 2 / 50

3 Introduction Scientific Computing Challenge How do we create reusable implementations which are also efficient? M. Knepley (UC) GPU SIAM 5 / 50

4 Introduction Scientific Computing Insight Structures are conserved, but tradeoffs change. M. Knepley (UC) GPU SIAM 6 / 50

5 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

6 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

7 Introduction Structure vs. Tradeoffs This is how PETSc works: Sparse matrix-vector product has a common structure Different storage formats are chosen based upon architecture PDE M. Knepley (UC) GPU SIAM 7 / 50

8 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

9 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

10 Introduction Structure vs. Tradeoffs A x = b { b, Ab, A(Ab), A(A(Ab)),... } This is how PETSc works: Krylov solvers have a common structure Different solvers are chosen based upon problem characteristics architecture M. Knepley (UC) GPU SIAM 8 / 50

11 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

12 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

13 Introduction Structure vs. Tradeoffs This is how treecodes work: Hierarchical algorithms have a common structure Different analytical and geometric decisions depend upon problem configuration accuray requirements M. Knepley (UC) GPU SIAM 9 / 50

14 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

15 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

16 Introduction Structure vs. Tradeoffs This is how biology works: For ion channels, Nature uses the same protein building blocks energetic balances Different energy terms predominate for different uses M. Knepley (UC) GPU SIAM 10 / 50

17 Introduction Representation Hierarchy Divide the work into levels: Model Algorithm Implementation M. Knepley (UC) GPU SIAM 11 / 50

18 Introduction Representation Hierarchy Divide the work into levels: Spiral Project: Model Discrete Fourier Transform (DSP) Algorithm Implementation Fast Fourier Transform (SPL) C Implementation (SPL Compiler) M. Knepley (UC) GPU SIAM 11 / 50

19 Introduction Representation Hierarchy Divide the work into levels: FLAME Project: Model Abstract LA (PME/Invariants) Algorithm Implementation Basic LA (FLAME/FLASH) Scheduling (SuperMatrix) M. Knepley (UC) GPU SIAM 11 / 50

20 Introduction Representation Hierarchy Divide the work into levels: FEniCS Project: Model Navier-Stokes (FFC) Algorithm Implementation Finite Element (FIAT) Integration/Assembly (FErari) M. Knepley (UC) GPU SIAM 11 / 50

21 Introduction Representation Hierarchy Divide the work into levels: Treecodes: Model Kernels with decay (Coulomb) Algorithm Implementation Treecodes (PetFMM) Scheduling (PetFMM-GPU) M. Knepley (UC) GPU SIAM 11 / 50

22 Introduction Representation Hierarchy Divide the work into levels: Treecodes: Model Kernels with decay (Coulomb) Algorithm Implementation Treecodes (PetFMM) Scheduling (PetFMM-GPU) Each level demands a strong abstraction layer M. Knepley (UC) GPU SIAM 11 / 50

Spiral Introduction Spiral Team, http://www.

net Uses an intermediate language, SPL, and

23 Spiral Introduction Spiral Team, Uses an intermediate language, SPL, and then generates C Works by circumscribing the algorithmic domain M. Knepley (UC) GPU SIAM 12 / 50

24 FLAME & FLASH Introduction Performance of the Matrix-Matrix Product (C=C+A*B) on GPU/CPU on S Algorithm-by-blocks on four T10 processors CUBLAS sgemm on a single T10 processor MKL sgemm on Intel Xeon QuadCore (4 cores) GFLOPS Matrix size Robert van de Geijn, FLAME is an Algorithm-By-Blocks interface FLASH/SuperMatrix is a runtime system M. Knepley (UC) GPU SIAM 13 / 50

25 Outline Short Introduction to FMM 1 Introduction 2 Short Introduction to FMM Spatial Decomposition Data Decomposition 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 14 / 50

26 Outline Short Introduction to FMM Spatial Decomposition 2 Short Introduction to FMM Spatial Decomposition Data Decomposition M. Knepley (UC) GPU SIAM 15 / 50

27 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

28 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

29 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

30 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

31 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

32 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

33 FMM in Sieve Short Introduction to FMM Spatial Decomposition The Quadtree is a Sieve with optimized operations Multipoles are stored in Sections Two Overlaps are defined Neighbors Interaction List Completion moves data for Neighbors Interaction List M. Knepley (UC) GPU SIAM 16 / 50

34 Tree Interface Short Introduction to FMM Spatial Decomposition locateblob(blob) Locate point in the tree fillneighbors() Compute the neighbor section findinteractionlist() Compute the interaction list cell section, allocate value section fillinteractionlist(level) Compute the interaction list value section fill(blobs) Compute the blob section dump() Produces a verifiable repesentation of the tree M. Knepley (UC) GPU SIAM 17 / 50

35 Outline Short Introduction to FMM Data Decomposition 2 Short Introduction to FMM Spatial Decomposition Data Decomposition M. Knepley (UC) GPU SIAM 18 / 50

36 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

37 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

38 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients M. Knepley (UC) GPU SIAM 19 / 50

39 FMM Sections Short Introduction to FMM Data Decomposition FMM requires data over the Quadtree distributed by: box Box centers, Neighbors box + neighbors Blobs box + interaction list Interaction list cells and values Multipole and local coefficients Notice this is multiscale since data is divided at each level M. Knepley (UC) GPU SIAM 19 / 50

40 Outline Serial Implementation 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 20 / 50

41 Serial Implementation Evaluator Interface initializeexpansions(tree, blobinfo) Generate multipole expansions on the lowest level Requires loop over cells O(p) upwardsweep(tree) Translate multipole expansions to intermediate levels Requires loop over cells and children (support) O(p 2 ) downwardsweep(tree) Convert multipole to local expansions and translate local expansions on intermediate levels Requires loop over cells and parent (cone) O(p 2 ) M. Knepley (UC) GPU SIAM 21 / 50

42 Serial Implementation Evaluator Interface evaluateblobs(tree, blobinfo) Evaluate direct and local field interactions on lowest level Requires loop over cells and neighbors (in section) O(p 2 ) evaluate(tree, blobs, blobinfo) Calculate the complete interaction (multipole + direct) M. Knepley (UC) GPU SIAM 22 / 50

43 Serial Implementation Kernel Interface Method Description P2M(t) Multipole expansion coefficients L2P(t) Local expansion coefficients M2M(t) Multipole-to-multipole translation M2L(t) Multipole-to-local translation L2L(t) Local-to-local translation evaluate(blobs) Direct interaction Evaluator is templated over Kernel There are alternative kernel-independent methods kifmm3d M. Knepley (UC) GPU SIAM 23 / 50

44 Outline Complexity Analysis 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 24 / 50

45 Complexity Analysis Greengard & Gropp Analysis For a shared memory machine, T = a N P + b log 4 P + c N BP + d NB P + e(n, P) (1) 1 Initialize multipole expansions, finest local expansions, final sum 2 Reduction bottleneck 3 Translation and Multipole-to-Local 4 Direct interaction 5 Low order terms A Parallel Version of the Fast Multipole Method, L. Greengard and W.D. Gropp, Comp. Math. Appl., 20(7), M. Knepley (UC) GPU SIAM 25 / 50

46 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

47 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

48 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

49 Complexity Analysis Distributed FMM Additions for distributed computing: Partitioning Explicit optimization problem to minimize Communication volume Load imbalance Uses PETSc Sieve for parallelism M. Knepley (UC) GPU SIAM 26 / 50

50 Outline Multicore Computing 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs M. Knepley (UC) GPU SIAM 27 / 50

51 Question Multicore Computing What is the optimal number of particles per cell? Greengard & Gropp Minimize time and maximize parallel efficiency c B opt = d 30 Gumerov & Duraiswami Follow GG, but also try to consider memory access B opt 91, but instead, they choose 320 Heavily weights the N 2 part of the computation We propose to cover up the bottleneck with direct evaluations M. Knepley (UC) GPU SIAM 28 / 50

52 Problem Missing Concurrency Multicore Computing We can balance time in direct evaluation with idle time for small grids. The direct evaluation takes time d NB p Assume a single thread group works on the first L tree levels Thus, we need B b 4 L+1 p (2) d N in order to cover the bottleneck. In an upcoming publication, we show that this bound holds for all modern processors. M. Knepley (UC) GPU SIAM 29 / 50

53 Problem Missing Bandwidth Multicore Computing We can restructure the M2L to conserve bandwidth Matrix-free application of M2L Reorganize traversal to minimize bandwidth Old Pull in 27 interaction MEs, transform to LE, reduce New Pull in cell ME, transform to 27 interaction LEs, partially reduce M. Knepley (UC) GPU SIAM 30 / 50

54 Multicore Computing Matrix-Free M2L The M2L transformation applies the operator M ij = 1 i t (i+j+1) ( i + j j ) (3) Notice that the t exponent is constant along perdiagonals. Thus we divide by t at each perdiagonal calculate the C ij by the recurrence along each perdiagonal carefully formulate complex division (STL fails here) M. Knepley (UC) GPU SIAM 31 / 50

55 Outline An Interface for Multicore Programs 1 Introduction 2 Short Introduction to FMM 3 Serial Implementation 4 Complexity Analysis 5 Multicore Computing 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 32 / 50

56 Outline An Interface for Multicore Programs FLASH 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 33 / 50

57 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

58 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

59 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

60 FLASH Design An Interface for Multicore Programs FLASH FLASH enables multicore computing through FLAME LA interface is identical to FLAME FLAME executes operates immediately FLASH queues operations, and Executes queues on user call (does nothing in FLAME) M. Knepley (UC) GPU SIAM 34 / 50

61 An Interface for Multicore Programs Cholesky Factorization FLASH FLA_Part_2x2(A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL); while(fla_object_length(atl) < FLA_Object_length(A)) { FLA_Repart_2x2_to_3x3( ATL, ATR, &A00, &A01, &A02, &A10, &A11, &A12, ABL, ABR, &A20, &A21, &A22, 1, 1, FLA_BR); FLASH_Chol(FLA_UPPER_TRIANGULAR, A11); FLASH_Trsm(FLA_LEFT,FLA_UPPER_TRIANGULAR,FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A12); FLASH_Syrk(FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22); FLA_Cont_with_3x3_to_2x2( &ATL, &ATR, A00, A01, A02, A10, A11, A12, &ABL, &ABR, A20, A21, A22, FLA_TL); } FLA_Queue_exec(); M. Knepley (UC) GPU SIAM 35 / 50

62 Outline An Interface for Multicore Programs PetFMM 6 An Interface for Multicore Programs FLASH PetFMM M. Knepley (UC) GPU SIAM 36 / 50

63 PetFMM-GPU An Interface for Multicore Programs PetFMM We break down sweep operations into Tasks Cell loops are now tiled Tasks are queued We can form a DAG since we know the dependence structure Scheduling is possible This asynchronous interface can enable Overlapping direct and multipole calculations Reorganizing the downward sweep Adaptive expansions M. Knepley (UC) GPU SIAM 37 / 50

64 GPU Classes An Interface for Multicore Programs PetFMM Section size() returns the number of values getfiberdimension(cell) returns the number of cell values restrict/update() retrieves and changes cell values clone/extract() converts between CPU and GPU objects Evaluator initializeexpansions() upwardsweep() downwardsweeptransform() downwardsweeptranslate() evaluateblobs() evaluate() M. Knepley (UC) GPU SIAM 38 / 50

65 GPU Classes An Interface for Multicore Programs PetFMM Section size() returns the number of values getfiberdimension(cell) returns the number of cell values restrict/update() retrieves and changes cell values clone/extract() converts between CPU and GPU objects Task Input data size Output data size Dependencies (future) TaskQueue Manages storage and offsets evaluate() M. Knepley (UC) GPU SIAM 38 / 50

66 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Transform Task cell block in cell and interaction list centers, interaction list multipole coeff out cell temp local coeff Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

67 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Transform Task cell block in cell and interaction list centers, cell multipole coeff out interaction list temp local coefficients Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

68 Tasks An Interface for Multicore Programs PetFMM Upward Sweep Task cell block in cell and child centers, child multipole coeff out cell multipole coeff Downward Sweep Reduce Task cell block in interaction list temp local coefficients out cell temp local coefficients Downward Sweep Expansion Task cell block in cell and parent centers, cell temp local coeff, parent local coeff out cell local coeff M. Knepley (UC) GPU SIAM 39 / 50

69 Transform Task An Interface for Multicore Programs PetFMM Shifts interaction cell multipole expansion to cell local expansion Add a task for each interaction cell All tasks with same origin are merged Local memory: 2 (p+1) blocksize (Pascal) + 2 p blocksize (LE) + 2 p (ME) 8 terms 4416 bytes 17 terms 9096 bytes Execution 1 block per ME Each thread reads a section of ME and the MEcenter Each thread computes an LE separately Each thread writes LE to separate global location M. Knepley (UC) GPU SIAM 40 / 50

70 Reduce Task An Interface for Multicore Programs PetFMM Add up local expansion contributions from each interaction cell Add a task for each cell Local memory: 2*terms (LE) 8 terms 64 bytes 17 terms 136 bytes Execution 1 block per output LE Each thread reads a section of input LE Each thread adds to shared output LE M. Knepley (UC) GPU SIAM 41 / 50

71 What s Important? Conclusions Interface improvements bring concrete benefits Facilitated code reuse Serial code was largely reused Test infrastructure completely reused Opportunites for performance improvement Overlapping computations Better task scheduling Expansion of capabilities Could now combine distributed and multicore implementations Could replace local expansions with cheaper alternatives M. Knepley (UC) GPU SIAM 42 / 50

72 Distributed FMM Parallel Tree Implementation Divide tree into a root and local trees Distribute local trees among processes Provide communication pattern for local sections (overlap) Both neighbor and interaction list overlaps Sieve generates MPI from high level description M. Knepley (UC) GPU SIAM 43 / 50

73 Distributed FMM Parallel Tree Implementation How should we distribute trees? Multiple local trees per process allows good load balance Partition weighted graph Minimize load imbalance and communication Computation estimate: Leaf N i p (P2M) + n I p 2 (M2L) + N i p (L2P) + 3 d Ni 2 (P2P) Interior n cp 2 (M2M) + n I p 2 (M2L) + n cp 2 (L2L) Communication estimate: Diagonal n c(l k 1) Lateral 2 d 2m(L k 1) 1 2 m 1 for incidence dimesion m Leverage existing work on graph partitioning ParMetis M. Knepley (UC) GPU SIAM 44 / 50

74 Distributed FMM Parallel Tree Implementation Why should a good partition exist? Shang-hua Teng, Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation, SIAM J. Sci. Comput., 19(2), Good partitions exist for non-uniform distributions 2D O ( n(log n) 3/2) edgecut 3D O ( n 2/3 (log n) 4/3) edgecut As scalable as regular grids As efficient as uniform distributions ParMetis will find a nearly optimal partition M. Knepley (UC) GPU SIAM 45 / 50

75 Distributed FMM Parallel Tree Implementation Will ParMetis find it? George Karypis and Vipin Kumar, Analysis of Multilevel Graph Partitioning, Supercomputing, Good partitions exist for non-uniform distributions 2D C i = 1.24 i C 0 for random matching 3D C i = 1.21 i C 0?? for random matching 3D proof needs assurance that averge degree does not increase Efficient in practice M. Knepley (UC) GPU SIAM 46 / 50

76 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

77 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

78 Distributed FMM Parallel Tree Implementation Advantages Simplicity Complete serial code reuse Provably good performance and scalability M. Knepley (UC) GPU SIAM 47 / 50

79 Distributed FMM Parallel Tree Interface fillneighbors() Compute neighbor overlap, and send neighbors findinteractionlist() Compute the interaction list overlap fillinteractionlist(level) Complete and copy into local interaction sections fill(blobs) Now must scatter blobs to local trees Uses scatterblobs() and gatherblobs() M. Knepley (UC) GPU SIAM 48 / 50

80 Distributed FMM Parallel Data Movement 1 Complete neighbor section 2 Upward sweep 1 Upward sweep on local trees 2 Gather to root tree 3 Upward sweep on root tree 3 Complete interaction list section 4 Downward sweep 1 Downward sweep on root tree 2 Scatter to local trees 3 Downward sweep on local trees M. Knepley (UC) GPU SIAM 49 / 50

81 Distributed FMM Parallel Evaluator Interface initializeexpansions(local trees, blobinfo) Evaluate each local tree upwardsweep(local trees, partition, root tree) Evaluate each local tree and then gather to root tree downwardsweep(local trees, partition, root tree) Scatter from root tree and then evaluate each local tree evaluateblobs(local trees, blobinfo) Evaluate on all local trees evaluate(tree, blobs, blobinfo) Identical M. Knepley (UC) GPU SIAM 50 / 50

Fast Methods with Sieve

Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley