General overview and first results.

Size: px

Start display at page:

Download "General overview and first results."

Megan McDowell
5 years ago
Views:

1 French Technological Watch Group: General overview and first results Fusion workshop at «Maison de la Simulation» 29/11/2016

2 FRENCH TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners SC'16 l 01/12/2016 l 2

3 FRENCH TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners Ø Share and mutualise expertise of partners at the national level Ø Anticipate upcoming Exascale architectures and provide orientations for future procurements ðmanycore/heterogeneous processors ðdeeper and more complex memory hierarchies ðfault Tolerance ðenergy optimisation Ø Organise code modernization for scientific communities ðease migration to these new platforms ðpreserve code legacy by using standards OpenMP ðenable potential specific optimisations with low level languages SC'16 l 01/12/2016 l 3

TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners Link to developers of

RAMSES, EMMA, SPECFEM3D Combustion : YALES2, AVBP, TRUST Fusion : GYSELA5D

QR_MUMPS, QMC=Chem Deep/Machine Learning Link to developers of tools and libraries

centers: CINES, IDRIS and ) Profiling perf, energy Numercial solvers Data

4 TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners Link to developers of representative applications Climate : DYNAMICO, MesoNH Astro & Geophysics : RAMSES, EMMA, SPECFEM3D Combustion : YALES2, AVBP, TRUST Fusion : GYSELA5D Materials : METALWALLS Particle Physics : SMILEI, CMS-MEM Kernels : PATMOS, HYDRO, QR_MUMPS, QMC=Chem Deep/Machine Learning Link to developers of tools and libraries Core of the group : HPC expertise System (runtime, checkpoint/restart National centers: CINES, IDRIS and ) Profiling perf, energy Numercial solvers Data management/analysis TGCC Inria Maison de la Simulation «Groupe Calcul» GENCI Vendors support SC'16 l 01/12/2016 l 4

FRENCH TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners q Deployment

Intel KNL 48 KNL nodes => 146 Tflop/s peak performance An IBM OpenPOWER

node Nvlink between P8 and GPU 12 nodes => 254 Tflop/s peak performance (GPU

highest maturity q Tight collaboration between : Vendors, integrators,

profiling tools Allinea - Performance Report / MAQAO / Others Intel Vtune

5 FRENCH TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners q Deployment of 2 early technology platforms A Bull sequana platform at CINES based on Intel KNL 48 KNL nodes => 146 Tflop/s peak performance An IBM OpenPOWER platform at IDRIS based on P8+ Nvidia GPU 4*P100 GPU (latest generation) per node Nvlink between P8 and GPU 12 nodes => 254 Tflop/s peak performance (GPU only) More than 400GB/s bandwidth per node Software stack is not at its highest maturity q Tight collaboration between : Vendors, integrators, developers, HPC centres qevaluation of real time performance/energy profiling tools Allinea - Performance Report / MAQAO / Others Intel Vtune Amplifier and MPI Performance Snapshot (MPS) Nvprof for the OpenPOWER solution SC'16 l 01/12/2016 l 5

SOFTWARE ENVIRONMENT Programming Options for OpenPOWER qprogramming on Intel Knight Landing Straightforward: Intel Compiler, Intel MPI, OpenMP q Programming options for

6 SOFTWARE ENVIRONMENT Programming Options for OpenPOWER qprogramming on Intel Knight Landing Straightforward: Intel Compiler, Intel MPI, OpenMP q Programming options for OpenPOWER Not as straightforward as Intel KNL General availibility for OpenMP is scheduled for the end of 2016 Full support is scheduled for 2017 SC'16 l 01/12/2016 l 6

7 KNL PORTING AND OPTIMISATION First results q Porting & optimisation workshops with Atos & Intel support Workshop on Colfax Ninja developer platforms (Intel processor Xeon Phi 7210) Workshop using the 48 Intel Xeon Phi 7250 nodes machine Frioul qporting efficiency: 2 1,8 1,6 1,4 Effective mean speed up obtained after 2 workshops (Haswell vs KNL) Speed up 1,2 1 0,8 0,6 0,4 0, Working time (hours) Performance vs Haswell 24c Energy efficiency vs Haswell 24c SC'16 l 01/12/2016 l 7

8 KNL PORTING AND OPTIMISATION First results Genci, code KNL speedup 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 CFD CFD CFD Climate Climate Geophysics Kernel Kernel Kernel Materials Speedup (Haswell node) Speedup (Broadwell node) Base SC'16 l 01/12/2016 l 8

9 OPENPOWER PORTING AND OPTIMISATION First results qmesonh: weather forecasting code 60% of the code has been ported using OpenACC Results with 16 MPI processes and MPS (multi-process service) PPOINT PROJET CELLULE DE VEILLE TECHNOLOGIQUE l 01/12/2016 l 9

OPENPOWER PORTING AND OPTIMISATION First results qvasp (an IBM ported app) Results reflect the first results, the whole application still have to be ported Speedup vs

0,60 0,46 0,40 0,20 0,00 Firestone Firestone Firestone Firestone Minsky Minsky Firestone 20 MPI Tasks 20 Tasks Firestone 1 MPI Task 1 Tasks Firestone 4 MPI Task 4 Tasks

10 OPENPOWER PORTING AND OPTIMISATION First results qvasp (an IBM ported app) Results reflect the first results, the whole application still have to be ported Speedup vs Full-Node, CPU-Only Execution 2,00 1,80 1,60 1,80 1,86 35% Faster 80% Faster 86% Faster 1,40 1,20 1,00 The Reference 1,00 1,35 Only 25% Slower 0,80 Only 50% Slower 0,75 0,60 0,46 0,40 0,20 0,00 Firestone Firestone Firestone Firestone Minsky Minsky Firestone 20 MPI Tasks 20 Tasks Firestone 1 MPI Task 1 Tasks Firestone 4 MPI Task 4 Tasks Firestone 20 MPI Task 20 Tasks Minsky 1 MPI 1 Tasks 1 Minsky 4 MPI 4 Tasks 4 0 GPUs 1 GPUs 4 GPUs 4 GPUs GPUs 1 4 GPUs PPOINT PROJET CELLULE DE VEILLE TECHNOLOGIQUE l 01/12/2016 l 10

FRENCH TECHNOLOGICAL WATCH GROUP Focus on Gysela qwork has just began on GPU with Altimesh Very first results qfirst lessons (= general good practices) Avoid buffers: That store

11 FRENCH TECHNOLOGICAL WATCH GROUP Focus on Gysela qwork has just began on GPU with Altimesh Very first results qfirst lessons (= general good practices) Avoid buffers: That store addition/multiplication That are reused once àprefer recomputing than precomputing Use stack vs heap Malloc is very expensive (especially with many cores) Use zero-copy for data that you acces/write once SC'16 l 01/12/2016 l 11

TECHNOLOGICAL WATCH GROUP OPENING Phase 3 Q2

Applications additionnelles Applications

12 TECHNOLOGICAL WATCH GROUP OPENING Phase 3 Q qopening to a wider panel of applications in May 2017 Applications will be reviewed through the «DARI» preparatory access Applications additionnelles Applications additionnelles PPOINT PROJET CELLULE DE VEILLE TECHNOLOGIQUE l 01/12/2016 l 12

13 FRENCH TECHNOLOGICAL WATCH GROUP Questions? Thank you for your attention PPOINT PROJET CELLULE DE VEILLE TECHNOLOGIQUE l 01/12/2016 l 13

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community Pre-exascae Architectures: OpenPOWER Performance and Usabiity Assessment for French Scientific Community G. Hautreux (GENCI), E. Boyer (GENCI) Technoogica watch group, GENCI-CEA-CNRS-INRIA and French Universities