Implementation and evaluation of 3D FFT parallel algorithms based on software component model

Size: px

Start display at page:

Download "Implementation and evaluation of 3D FFT parallel algorithms based on software component model"

Hugh Watkins
5 years ago
Views:

1 Master 2 - Visualisation Image Performance University of Orléans ( ) Implementation and evaluation of 3D FFT parallel algorithms based on software component model Jérôme RICHARD October 7th 2014 under the supervision of Christian REZ and Vincent LANORE University tutor: Sophie ROBERT

2 High Performance Scientific applications require a lot of computing time Illustris Project Human Brain Project Parallelism used to reduce the computation time 2

3 High Performance Parallel programming paradigms Memory sharing One central memory readable/writable by multiple s Synchronization using locks, semaphores, transactional memory, etc. Examples : POSIX Threads, OpenMP, UPC Message passing Messages sent between s Synchronization using messages Collective communications Examples : MPI, PVM 3

) Accelerator (GPU, FPGA, etc.) Network topology (fat tree, toric 3D grid, hypercube, etc.

4 High Performance Parallel architectures Clusters Supercomputers Large variety of hardware parallel architectures used Processing nodes (processor, memory, hard drive, etc.) Accelerator (GPU, FPGA, etc.) Network topology (fat tree, toric 3D grid, hypercube, etc.) Maximize Maximize performances performances Adapt Adapt the the software software to to the the Hardware Hardware 4

5 High Performance Observations Hardware frequently renewed Optimizing for a specific hardware is a time-consuming and costly process Adapting Adapting applications applications on on hardware hardware takes takes to to much much time time Internship 3D FFT computing easy to handle/optimize variants Uses of software components to handle variability Adaptation Adaptation of of 3D 3D FFT FFT algorithms algorithms using using software software component component 5

6 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 6

7 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 7

8 Sequential FFTs Fast Fourier Transform : fast algorithm which converts a discretized signal (matrix) from time domain to frequency domain Used in molecular dynamics, astrophysics, meteorology, 3D tomography, seismology, etc. (e.g. GROMACS) Reference algorithm : Cooley-Tukey Time complexity : O(n log(n)) FFT that use >1 dimensions can be decomposed in multiple 1D FFT : FFT sur les lignes Expensive Expensive on on larges larges matrices matrices FFT sur les colonnes 8

9 Parallel 3D FFTs X Z Y Z X Y 0 1 X Y Z D/Slab decomposition 2 transposition phases 2 calculation phases Scaling limit : Count N where N is the size of the input matrix Scaling Scaling limit limit is is an an issue issue on on current current supercomputers supercomputers 9

10 Parallel 3D FFTs X Y Z Y X X Z Z Y Y X 1 2D/Pencil decomposition 4 transposition phases 3 calculation phases Scaling limit : Nombre N² X Y Z Z Better Better scaling scaling limit limit but but more more expensive expensive in in term term of of network network operation operation

11 How to Optimize 3D FFTs Adapt the transposition algorithm (total exchange) Bruck1994, Prisacari2013 Temps d'exécution Computation-communication overlapping Kandalla2011 Calculs Communications Temps d'exécution Calculs Communications Gain Adapt the number of transpositions Pekurovsky2012 X Z Y X Z X Y Z Y 11 Optimizations Optimizations depend depend on on hardware hardware and and software software parameters parameters

12 Existing Implementations Sequential libraries Examples : ESSL, MKL, FFTW, etc. Parallel libraries Use sequential libraries Conditional compilation Examples : P3DFFT, 2DECOMP Issues : overlapping of highly optimized small pieces of code Compilation of a high level mathematical description Examples : FFTW, SPIRAL Issues : optimizations integrated into generators/compilers 12 Adding Adding new new optimizations optimizations is is difficult difficult and and can can be be aa tricky tricky process process

13 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 13

14 Component Models Build an application by assembling component instances Component Black box with interfaces Dissociates the interface and its implementation Interacts through its interfaces Assembly Component instances Interface connexion Benefits Clavier usb usb Ordinateur Separation of concerns Reuse 14

15 Distributed Component Models Examples of distributed component models CCM (Corba Component Model) GCM (Grid Component Model) Overhead Overhead not not acceptable acceptable for for HPC HPC applications applications Examples of HPC dedicated component models CCA (Common Component Architecture) No No MPI MPI in in the the model model et et granularity granularity limited limited to to process process L²C (Low Level Component) 15

16 Low Level Component (L²C) Component model that does not hide system issues Developed inside the Avalon team C++ or FORTRAN components that have attributes and ports Intra-process interactions with use/provide ports C++ FORTRAN Inter-process interactions Message-passing via MPI Remote procedure call via CORBA Assemblage describe by a file (processes, attributes, instances...) 16 Efficient Efficient assembly, assembly, but but exhaustive exhaustive description description Generation Generation

17 Models features Ordinateur portable Clavier Hierarchy : component instance can be an assembly (composite components) out out Souris in in s c Carte mère out e in Écran Connectors : element allowing to interconnect multiple instance ports without define specifically how Ordinateur client LAN hôte Ordinateur client Ordinateur Genericity : abstraction of components allowed by using generic component then specialized C = Ordinateur X=3 Squelette<C,X> C 17 Features Features provide provide by by HLCM HLCM

18 HLCM High Level Component Model Developed inside the Avalon team Features Hierarchy Generic Based on connectors Transform a high level assembly into a low level assembly Targeted low level model can be L²C, Gluon++, etc. 18

19 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 19

20 FFTs 3D Assembly Sequential Assembly Distributed Assembly Design assembly variants using Local transformations Global transformations 20

21 Basic FFT 3D Assembly Port provide Port use LocalDataInitializer (2) Instances de composants LocalDataFinalizer (6) LocalMaster (8) go Allocator (1) allocate Planifier1D_X (3) Sequential assembly init 8 tasks identified and mapped to components init Planifier2D_XY (3) ProcessingUnit (7) FFTW (4) 2 control tasks plan LocalTranspose_XZ (5) 6 computing tasks ProcessingUnit (7) FFTW (4) plan LocalTranspose_XZ (5) 3D 3D FFT FFT sequential sequential algorithms algorithms can can be be implemented implemented in in L²C L²C 21

22 Distributed FFTs 3D Assembly MPI MPI SlabDataInitializer (2) go SlabDataInitializer (2) MPI SlabDataFinalizer (6) SlabMaster (8) SlabDataFinalizer (6) Allocator (1) Allocator (1) allocate Planifier1D_X (3) Planifier1D_X (3) init init Planifier2D_XY (3) Planifier2D_XY (3) init init ProcessingUnit (7) FFTW (4) FFTW (4) plan MpiTransposeSync_XZ (5) plan MPI ProcessingUnit (7) MpiTransposeSync_XZ (5) ProcessingUnit (7) SlabMaster (8) allocate go FFTW (4) FFTW (4) plan MpiTransposeSync_XZ (5) plan MPI ProcessingUnit (7) MpiTransposeSync_XZ (5) Assembly duplicated on each MPI process 4 components replaced (providing a MPI interface) 22 1D 1D decomposition decomposition can can be be implemented implemented in in L²C L²C

23 Local Transformations Component replacement used to Adapt transpositions Switch the sequential implementation Attributes configuration used to Load balance work between instances Allowing overlapping LocalTranspose_XZ (5) MpiTranspose_XZ (5) LAD (XML) [ ] <property id="sizex"> <value type="uint64"> 256 </value> </property> [ ] 23

24 Global Transformations 2D decomposition MPI MPI PencilDataInitializer go PencilDataInitializer MPI PencilDataFinalizer PencilMaster PencilDataFinalizer Allocator Allocator allocate Planifier1D_X Planifier1D_X init init Planifier1D_X Planifier1D_X init init Planifier1D_X Planifier1D_X init init ProcessingUnit FFTW FFTW plan plan MPI MpiTransposeSync_XY ProcessingUnit ProcessingUnit FFTW FFTW plan plan MPI MpiTransposeSync_XZ ProcessingUnit ProcessingUnit FFTW MpiTransposeSync_XZ FFTW plan plan MPI ProcessingUnit MpiTransposeSync_XZ MpiTransposeSync_XY MpiTransposeSync_XZ MpiTransposeSync_XY PenciMaster allocate go MPI MpiTransposeSync_XY 24 2D 2D decomposition decomposition can can be be implemented implemented in in L²C L²C

25 Global Transformations Adaptation of the number of transpositions 25 Easy Easy to to adjust adjust the the number number of of transpositions transpositions

26 Variants 1D decomposition 2D decomposition L2C_1D_[...] L2C_2D_[...] Without load balancing With load balancing L2C_XD_[...] L2C_XDH_[...] Without overlapping With overlapping L2C_XD_[...] L2C_XDR_[...] Transposition count 1D Decomposition 2D Decomposition L2C_1D_1t_xz L2C_1D_1t_yz L2C_1D_2t_xz L2C_1D_2t_yz L2C_2D_2t_yzx L2C_2D_2t_zxy L2C_2D_3t_xzy L2C_2D_3t_yxz L2C_2D_3t_zyx L2C_2D_4t_xyz All assembly implemented (except those using both overlapping et 2D decomposition) 26 Some Some optimization optimization implemented implemented Many Many variants variants

27 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 27

28 Reuse Analysis Version Line count (C++ code) Proportion of reused code L2C_1D_2t_xz L2C_1D_1t_yz % L2C_1D_2t_yz % L2C_1D_2t_yz_blk % L2C_1DH_1t_yz % L2C_1DH_2t_yz_blk % L2C_2D_4t_xyz % L2C_2D_2t_zxy % L2C_2DH_2t_zxy % Total 30 components 3743 lines of C++ code 28 Assembly Assembly components components are are reusable reusable

29 Performance Measurement Experimental platform Grid'5000 and Curie Reference implementation FFTW and 2DECOMP Input matrix size 2p x 2p x 2p with p ℕ Launch count 100 Result computing method Average Error bars 1er et 3ème quartiles Compilers GCC on Grid'5000 and ICC on Curie MPI implementation OpenMPI 29

1D decomposition 2D decomposition 30 L²C L²C assemblies

30 Performance Measurement and Scaling Large scale experiments on the Curie supercomputer Matrix size : 1024 x 1024 x D decomposition 2D decomposition 30 L²C L²C assemblies assemblies scale scale well well up up to to cores cores

31 Variants Evaluation and Analysis Heterogeneous experiments (on Grid'5000) Matrix bloc size experimentally adapted 1 slow cluster and 1 fast cluster used Matrix size : 256 x 256 x 256 1D decomposition 2D decomposition 31 Assemblies Assemblies implement implement load load balancing balancing

32 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 32

33 Conclusion and Perspectives Examine whether component models can handle variability of 3D FFT codes Contribution Design and implement 3D FFT algorithms with L²C Experiment on Grid'5000 et Curie Results Easy to handle variants (code highly reused between assembly) High performance (competitive with reference implementations) Perspectives Design higher level assembly using HLCM Implement selection algorithms to automatically specialize assemblies 33

Towards a Reconfigurable HPC Component Model

C2S@EXA Meeting July 10, 2014 Towards a Reconfigurable HPC Component Model Vincent Lanore1, Christian Pérez2 1 ENS de Lyon, LIP 2 Inria, LIP Avalon team 1 Context 1/4 Adaptive Mesh Refinement 2 Context