NVIDIA Application Lab at Jülich

Size: px

Start display at page:

Download "NVIDIA Application Lab at Jülich"

Sabrina Wade
6 years ago
Views:

1 Mitglied der Helmholtz- Gemeinschaft NVIDIA Application Lab at Jülich Dirk Pleiter Jülich Supercomputing Centre (JSC)

Forschungszentrum Jülich at a Glance (status 2010) Budget: 450 mio Euro Staff: 4,800 (thereof 1,630 scientists) Visiting scientists: 900 per year Trainees: 90 Publications: 1,800 Protective

2 Forschungszentrum Jülich at a Glance (status 2010) Budget: 450 mio Euro Staff: 4,800 (thereof 1,630 scientists) Visiting scientists: 900 per year Trainees: 90 Publications: 1,800 Protective rights and licences: 14,800 Research fields: health, energy and environment, and information technology; key technologies for tomorrow Dirk Pleiter NVIDIA Application Lab at Jülich 2

Jülich Supercomputing Centre Supercomputer operation for: Centre FZJ, Regional

User support; coordination with SimLabs Scientific Visualization Peer review

Community data management service Computer architectures, Exascale Laboratories:

3 Jülich Supercomputing Centre Supercomputer operation for: Centre FZJ, Regional JARA Helmholtz & National NIC, GCS Europe PRACE, EU projects Application support User support; coordination with SimLabs Scientific Visualization Peer review support and coordination R&D work Algorithms, performance analysis and tools Community data management service Computer architectures, Exascale Laboratories: EIC, ECL, NVIDIA Education and Training Dirk Pleiter NVIDIA Application Lab at Jülich 3

Supercomputer Systems: Dual Track Approach 2004 2006-8 2009 2012 2014 IBM

TFlop/s JUROPA++ Cluster, 1-2 PFlop/s + Booster General-Purpose File

TFlop/s Highly-Scalable IBM Blue Gene/P JUGENE, 1 PFlop/s IBM Blue Gene/Q

4 Supercomputer Systems: Dual Track Approach IBM Power 6 JUMP, 9 TFlop/s JUROPA 200 TFlop/s HPC-FF 100 TFlop/s JUDGE 240 TFlop/s JUROPA++ Cluster, 1-2 PFlop/s + Booster General-Purpose File Server GPFS, Lustre IBM Power 4+ JUMP, 9 TFlop/s IBM Blue Gene/L JUBL, 45 TFlop/s Highly-Scalable IBM Blue Gene/P JUGENE, 1 PFlop/s IBM Blue Gene/Q JUQUEEN 5.7 PFlop/s (target) Dirk Pleiter NVIDIA Application Lab at Jülich 4

JUDGE Cluster System 206 IBM idataplex nodes 2 Tesla M2050 or M2070 per node Infiniband QDR network Peak performance: 239 Tflops Users Institute for Advanced Simulations Molecular

5 JUDGE Cluster System 206 IBM idataplex nodes 2 Tesla M2050 or M2070 per node Infiniband QDR network Peak performance: 239 Tflops Users Institute for Advanced Simulations Molecular dynamics and mechanics, micro-magnetism simulations, medical image reconstruction JuBrain partition Milkey Way partition Dirk Pleiter NVIDIA Application Lab at Jülich 5

6 NVIDIA Application Lab at Jülich Collaboration between JSC and NVIDIA since July 2012 Enable scientific applications for GPU-based architectures Provide support for their optimization Investigate performance and scaling Work focus Application requirements analysis Kepler and CUDA feature analysis Parallelization on many GPUs Collaboration with performance tools developers Training Dirk Pleiter NVIDIA Application Lab at Jülich 6

Markus Axer, Marcel Huysegoms Research goal Accurate, highly detailed

7 Pilot Application: JuBrain Application developed at the Institute of Neuroscience and Medicine (INM-1) at Forschungszentrum Jülich: Katrin Amunts, Markus Axer, Marcel Huysegoms Research goal Accurate, highly detailed computer model of the human brain Dirk Pleiter NVIDIA Application Lab at Jülich 7

Brain Section Images Blockface pictures Created while

Polarized light images Low resolution vs.

Gbytes data Exceeds GPU memory capacity Challenge: 3d

8 Brain Section Images Blockface pictures Created while cutting brain in sections Histological images Polarized light images Low resolution vs. high resolution 100 μm 3 μm pixel size 30 MBytes 40 Gbytes data Exceeds GPU memory capacity Challenge: 3d reconstruction Dirk Pleiter NVIDIA Application Lab at Jülich 8

9 3D Reconstruction Moving image Metric Optimizer Fixed image Interpolator Transformation Registration algorithms Rigid registration 3 parameters Afine registration 6 parameters Elastic registration O(100) parameters O(30) speedup on GPU Dirk Pleiter NVIDIA Application Lab at Jülich 9

Fluid dynamics on Fermi and Kepler Lattice Boltzmann method

U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by

Simulation of large systems requires double precision

10 Fluid dynamics on Fermi and Kepler Lattice Boltzmann method D2Q37 model Application developed at U Rome Tore Vergata/INFN, U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation on many GPUs Dirk Pleiter NVIDIA Application Lab at Jülich 10

11 Collide kernel on Fermi Kernel dominated by arithmetic operations Floating-point performance as a function of the number of threads/block [GFlop/s] Excellent performance on Fermi Implementation: F. Schifano (U Ferrara/INFN) Dirk Pleiter NVIDIA Application Lab at Jülich 11

12 Kepler Performance Tuning Performance analysis observations Significant increase of L1 cache misses 17% (Tesla M2090) 67% (Tesla K20) SM performance increased, but L1 cache capacity remained unchanged for (i = 0; i < NPOP-1; i++) { lpop = p_prv[i*nx*ny + idx]; u = u + param_cx[i] * lpop; v = v + param_cy[i] * lpop; } #pragma unroll for (i = 0; i < NPOP-1; i++) { lpop = p_prv[i*nx*ny + idx]; u = u + param_cx[i] * lpop; v = v + param_cy[i] * lpop; } Problem mitigation by simple code change Enforce loop unrolling to eliminate indirect memory accesses J. Kraus (NVIDIA Lab) Dirk Pleiter NVIDIA Application Lab at Jülich 12

13 Collide kernel on Kepler GK110 Comparison Fermi vs. Kepler Grid size considered here: 252 x Floating-point performance as a function of the number of threads/block Performance improvement 1.7x Dirk Pleiter NVIDIA Application Lab at Jülich 13

14 Propagate kernel Kernel dominated by memory access Grid size considered here: 252 x Memory bandwidth [GByte/s] as a function of the number of threads/block Performance improvement 1.4x Dirk Pleiter NVIDIA Application Lab at Jülich 14

15 Summary NVIDIA Application Lab at Jülich New and fruitful model for collaboration We are just at the beginning... Application requirements analysis JuBrain: Project aiming for realistic model of the human brain Kepler feature analysis Initial performance results for Lattice Boltzmann application on GK110 Very high performance level reached on Fermi can be sustained Dirk Pleiter NVIDIA Application Lab at Jülich 15

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich Mitglied der Helmholtz-Gemeinschaft Welcome to the Jülich Supercomputing Centre D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich Schedule: Monday, May 18 13:00-13:30 Welcome