Data partitioning and MPI adjoints. Pavanakumar Mohanamuraly Jens D. Mueller

Size: px

Start display at page:

Download "Data partitioning and MPI adjoints. Pavanakumar Mohanamuraly Jens D. Mueller"

Robert Douglas
5 years ago
Views:

1 Data partitioning and MPI adjoints Pavanakumar Mohanamuraly Jens D. Mueller

2 SCHEMA Motivation Problem and solution Results PDE constraint optimisation min a J(u, a) s.t. R(u, a) = 0 Primal and adjoint fixed point iteration u k+1 = u k M 1 R(u k, a) ū k+1 = ū k M T [ R u T ūk J u T ]

3 SCHEMA file:///users/kumar/documen Motivation Problem and solution Results

4 SCHEMA Motivation Problem and solution Results

5 GRADIENT VIA FINITE DIFFERENCE

6 GRADIENT VIA FINITE DIFFERENCE

7 GRADIENT VIA FINITE DIFFERENCE

8 GRADIENT VIA FINITE DIFFERENCE

9 PARTITIONING STRATEGY

10 PARTITIONING STRATEGY

11 PARTITIONING STRATEGY

12 PARTITIONING STRATEGY

13 PARTITIONING STRATEGY

14 PARTITIONING STRATEGY

15 PARTITIONING STRATEGY

16 PARTITIONING STRATEGY

17 Primal f a = f b f a f b = f c f a f c = f c f b f a f b f c = f a f b f c Adjoint f c += f b + f c f b += f a f c f a += f a f b f a = f b = f c = 0 f a f b f c += f a f b f c

18 HALO APPROACH Primal send(f a ); recv(f b ) f a = f b f a Adjoint f b += f a f a += f a recv(t); f a += t send( f b ) send(f b ); recv(f a ) f b = f c f a f c = f c f b f c += f b + f c f b += f c f a += f b recv(t); f b += t send( f a )

19 ZERO-HALO APPROACH Primal Adjoint f a = f b f a f b = f a accumulate(f b )? f b = f c f c = f c f b accumulate(f b )?

20 WHAT IS ACCUMULATE? accumulate(f b ) send(f b ) recv(t) f b += t accumulate(f b ) send(f b ) recv(t) f b += t

21 EXPECTATION... Primal Adjoint f a = f b f a f b = f a accumulate(f b ) f b = f c f c = f c f b accumulate(f b ) accumulate_b( f b ) f b += f a f a += f a f b accumulate_b( f b ) f b + = f c f c += f b + f c

22 IN REALITY... Primal [ ] f a f b = [ ] [ fa f b ] [ f b f c ] = [ ] [ fb f c ] Adjoint [ ] fa f b += [ ] [ f a f b ] [ fb f c ] += [ ] [ f b f c ] f a += f a f b fc += f b + f c f b += f a f b += f c accumulate( f b )! accumulate( f b )!

23 IN REALITY... Primal [ ] f a f b = [ ] [ fa f b ] [ f b f c ] = [ ] [ fb f c ] Adjoint [ ] fa f b += [ ] [ f a f b ] [ fb f c ] += [ ] [ f b f c ] f a += f a f b fc += f b + f c f b += f a f b += f c accumulate( f b )! accumulate( f b )!

24 IN REALITY... Primal [ ] f a f b = [ ] [ fa f b ] [ f b f c ] = [ ] [ fb f c ] Adjoint [ ] fa f b += [ ] [ f a f b ] [ fb f c ] += [ ] [ f b f c ] f a += f a f b fc += f b + f c f b += f a f b += f c accumulate( f b )! accumulate( f b )!

25 IN REALITY... Primal [ ] f a f b = [ ] [ fa f b ] [ f b f c ] = [ ] [ fb f c ] Adjoint [ ] fa f b += [ ] [ f a f b ] [ fb f c ] += [ ] [ f b f c ] f a += f a f b fc += f b + f c f b += f a f b += f c accumulate( f b )! accumulate( f b )!

26 ZERO-HALO APPROACH Primal Adjoint f a = f b f a f b = f a accumulate(f b ) f b = f c f c = f c f b accumulate(f b ) f a += f a f b f b += f a accumulate( f b ) f c += f b + f c f b + = f c accumulate( f b )

27 MISSING LINK... file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... Magical appearance of shared node values? Hidden or implicit MPI calls? Need to know the complete call structure Tough to find... is there an alternative (Hack!)? Well I just showed you one

28 MISSING LINK... file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... Magical appearance of shared node values? Hidden or implicit MPI calls? Need to know the complete call structure Tough to find... is there an alternative (Hack!)? Well I just showed you one

29 MISSING LINK... file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... Magical appearance of shared node values? Hidden or implicit MPI calls? Need to know the complete call structure Tough to find... is there an alternative (Hack!)? Well I just showed you one

30 MISSING LINK... file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... Magical appearance of shared node values? Hidden or implicit MPI calls? Need to know the complete call structure Tough to find... is there an alternative (Hack!)? Well I just showed you one

31 MISSING LINK... file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... Magical appearance of shared node values? Hidden or implicit MPI calls? Need to know the complete call structure Tough to find... is there an alternative (Hack!)? Well I just showed you one

32 IDEA file:///users/kumar/documents/pavanphd/ecomacs/presentation/fi... MPI-AD constructed from sparse graph/matrix Most scientific problem of the form Ax = b Paradigm translates to d and 3d Also to non-linear operators R[U] = 0

33 FIXED POINT ITERATION u k+1 = u k M 1 R(u k, a) Primal J = J(u k+1 ) Cost function [ ] ū k+1 = ū k M T T R ū k J T Adjoint u u Primal FPI... do i t e r = 1, n c a l l residue ( u, R ) c a l l update ( u, R ) end do c a l l cost_fun ( u, J ) Hand assembled adjoint FPI J = 1 J c a l l cost_fun_b ( u, T u J, J, J ) do i t e r = 1, n R T c a l l residue_b ( u, u v, R, v ) R = R T u v J T u c a l l update_b ( v, R ) end do Primal + Adjoint FPI quite expensive, need to run in parallel

34 ZERO-HALO PARTITIONING No extra storage of halos Implementated in our in-house code Fluxes calculated for every edge and summed-up to vertex Need accumulation operation at shared nodes

35 R AND R T u u Primal FPI do i t e r = 1, n c a l l residue ( u, R ) c a l l accumulate ( R ) c a l l update ( u, R ) end do c a l l cost_fun ( u, J ) Hand assembled adjoint FPI J = 1 J c a l l cost_fun_b ( u, T u J, J, J ) do i t e r = 1, n R T c a l l residue_b ( u, u v, R, v ) c a l l R = accumulate ( R T u v ) R T u v J T u c a l l update_b ( v, R ) end do Self-adjoint MPI Reflected in the FPI code What about cost function evaluation?

36 HAND ASSEMBLED COST FUNCTION (MPI) Cost function primal Cost function adjoint cost_fun ( u, J ) : cost_fun_b ( u, ū, J, J ) : J i = J(u) J = i J i ū = ū + ( J U ) T J c a l l accumulate ( ū ) Accumulate operation required for adjoint non-intutive!

37 SOME IMPROVEMENTS Adjoint FPI (two accumulates) J = 1 J c a l l cost_fun_b ( u, T u J, J, J ) c a l l accumulate ( u J T ) do i t e r = 1, n R T c a l l residue_b ( u, u v, R, v ) c a l l R = accumulate ( R T u v ) R T u v J T u c a l l update ( v, R ) end do Adjoint FPI (single accumulate) J = 1 J c a l l cost_fun_b ( u, T u J, J, J ) do i t e r = 1, n c a l l residue_b ( u, R u T v, R, v ) R = c a l l R T u v J T u accumulate ( R T u v ) c a l l update ( v, R ) end do # MPI calls reduced to just one by aggregation!

38 RESULTS Strong scaling for a d case on Intel i7 processor (four core) 4 Speed up 3 Ideal Primal MPI Primal OpenMP Adjoint MPI Adjoint OpenMP Speed%up% 4 3 Primal' Adjoint' N (a) Pure MPI and OpenMP 0 ranks + threads 4 ranks (MPI only) 4 threads (OpenMP only) (b) Hybrid MPI and OpenMP

39 SUMMARY Call-structure for MPI codes can be deceptive Especially for zero-halo partitioning View MPI-AD problem as parallel sparse matrix multiplication (possible in our case) From our experience this to works for most partitioning strategy

40 THANK YOU We thank the European commission for funding this work under the H00 framework s IODA project

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College