Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Size: px

Start display at page:

Download "Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters"

Archibald Bates
5 years ago
Views:

1 Flexible Hardware Mapping for Finite Element Simulations on Hybrid /GPU Clusters Aaron Becker Isaac Dooley Laxmikant Kale SAAHPC, July Champaign-Urbana, IL

2 Target Application Inhomogeneous material simulation 3D finite elements (tetrahedra), explicit structural dynamics Simple kernels compute forces on each element at every time step Stiffness matrix varies with location in material--very memory intensive Existing Charm++/ParFUM application works on traditional clusters 2

3 Target Hardware: NCSA Lincoln 2x Intel Harpertown E5410 s 1/2 of a Tesla S1070 (2 GPUs) Infiniband interconnect 192 nodes Runs spanning Abe and Lincoln nodes may be possible in the future Multiple powerful s and GPUs on each node. How do we take advantage of all of it? 3

Approach Over-decompose the mesh into many partitions per node Write GPU and implementations of computational kernels Each partition can be handled by either or GPU Choose a mapping of partitions to

4 Approach Over-decompose the mesh into many partitions per node Write GPU and implementations of computational kernels Each partition can be handled by either or GPU Choose a mapping of partitions to hardware that maximizes utilization Partitioning, ghost management, and synchronization is handled by ParFUM on the Goal: flexibility in number/size of partitions and assignment of partitions to hardware 4

5 Compute nodal forces from displacements using the stiffness matrix Physics Update velocity & acceleration Impose boundary conditions Update nodal displacements Routines that can run on either or GPU GPU specific routines specific routines Synchronization Copy force data from GPU Sum forces on shared nodes Copy force data to GPU If target time not reached 5

6 ParFUM Hybrid API Management of host/device node and element data Compatible API for writing and GPU kernels On : loop over nodes or elements using iterators On GPU: each thread is responsible for one node or element Functions for inter-partition synchronization 6

7 ParFUM Hybrid API nodeiterator itr; for (nodeitr_begin(itr); nodeitr_isvalid(itr); nodeitr_next(itr)) { n_data = node_getdata(itr); for (int i=0; i<dof; ++i) { float a_old = n_data >a[i]; n_data >a[i] = n_data >F[i] / n_data >mass; n_data >v[i] += 0.5 * dt * (n_data >a[i] + a_old); } } 7

8 ParFUM Hybrid API nodeiterator itr; for (nodeitr_begin(itr); nodeitr_isvalid(itr); nodeitr_next(itr)) { n_data = node_getdata(itr); for (int i=0; i<dof; ++i) { float a_old = n_data >a[i]; n_data >a[i] = n_data >F[i] / n_data >mass; n_data >v[i] += 0.5 * dt * (n_data >a[i] + a_old); } } 8

9 ParFUM Hybrid API n_data = node_gpu_getdata(my_node); for (int i=0; i<dof; ++i) { float a_old = n_data >a[i]; n_data >a[i] = n_data >F[i] / n_data >mass; n_data >v[i] += 0.5 * dt * (n_data >a[i] + a_old); } } 9

10 ParFUM Hybrid API n_data = node_gpu_getdata(my_node); for (int i=0; i<dof; ++i) { float a_old = n_data >a[i]; n_data >a[i] = n_data >F[i] / n_data >mass; n_data >v[i] += 0.5 * dt * (n_data >a[i] + a_old); } } 10

11 Managing Data Races Independent GPU threads introduce races when they update common data structures. (e.g. nodes updating shared element quantities) Perform each write into a separate slot in the element data structure, accumulate values in next element kernel. Possible alternative: graph coloring 11

12 Managing GPU Partitions Each GPU partition needs to synchronize with the host at each step We expect a large number of GPU partitions, so how will they be managed? Partition Partition Partition Partition Partition Partition Partition Partition GPU GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition Partitions Normal GPU Manager 12

13 Mapping and Load Balance 16 13

14 Mapping and Load Balance Partitions (2 per core) 16 14

15 Mapping and Load Balance GPU Partitions (34 per node) 15

16 Optimization Pack: identify the data needed for synchronization and copy only that data between host and device Async: run all memory transfers and kernels asynchronously (enables overlap) Overlap: let the GPU manager cores run partitions while waiting for GPU partitions Normalized Performance 0 Baseline Pack Async Overlap 16

17 Characterizing Performance 12% 24% 24% 39% Read Data Compute Forces Writing Data Other 37% GPU 54% 9% 17

18 Scaling 120 Single-Precision Double-Precision 97%, 99% Parallel Efficiency 100 Spedup Over 1 Node Number of Nodes 18 18

19 Flexible Hardware Mapping for Finite Element Simulations on Hybrid /GPU Clusters Aaron Becker Isaac Dooley Laxmikant Kale SAAHPC, July Champaign-Urbana, IL

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods Aaron K. Becker (abecker3@illinois.edu) Robert B. Haber Laxmikant V. Kalé University of Illinois, Urbana-Champaign Parallel