Experiences with GPGPUs at HLRS

::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: Experiences with GPGPUs at HLRS Stefan Wesner, Managing Director High Performance Computing Centre Stuttgart :: :: ::

HLRS Context and Challenges ahead PRACE Tier- 0 Centre 1 PF in 2011 4-5 PF in 2013 10 PF in 2015? HPC Service for Industry Research towards Exascale CREST! Na>onal Supercompu>ng centre HPC service for ~100 projects with several hundred users targebng different level of parallelism and disciplines but with a focus on engineering Experiences with GPGPUs at HLRS :: 2

Drivers and Issues for GPGPU adoption @ HLRS Issues Complex codes with long history Legacy codes designed and adapted for dinosaur compubng system architectures High level of innovabon is paralyzing! APIs are too low level or not standardized (protecbon of my investment?) Industrial customers of HLRS demands for stable environments GPGPU ExpectaBons Very high performance High Memory Bandwidth High level of innovabon is excibng! APIs allowing full low level control are excibng! Experiences with GPGPUs at HLRS :: 3

GPGPU Deployment History and Future Starting point: Research activities mostly in the visualization department Initial Deployment on National Resource: Laki Intel Nehalem/Tesla S1070 Hermit1 will be equipped with GPGPUs (2012) Hermit2 PRACE Tier- 0 System will have a visible accelerator share GPGPU research <2008 NEC Cluster Laki 32*S1070 62 TF peak 2008 Cray XE6 Hermit1 Phase1 Step1 ~1PF Peak Q3/2011 Update of Hermit1 with 32 Nodes CRAY XK6 2012 Cray Cascade Hermit2 Phase1 Step2 ~4-5PF Peak 2013 Experiences with GPGPUs at HLRS :: 4

Use- Case: Erosion in Turbine Runners CFD SimulaBon: Ansys CFX Unstructured grid 15.215.488 elements Contact: Florian Niebling, niebling@hlrs.de Dr. Uwe Wössner, woessner@hlrs.de Experiences with GPGPUs at HLRS :: 5

Use- Case: Parallel Surface Extraction (GPU) Parallelization of iso-/cutting surface extraction for interactive post-processing on unstructured grids NVIDIA Fermi GPU: >5x faster than 16 Xeon E5472 MPI MPI MPI Renderer Module 1 Module 1 Module 2 Datamanager Shared Memory GPU MPI Transport Layer Renderer Module 1 Module 1 Module 2 Datamanager Shared Memory GPU Contact: Florian Niebling, niebling@hlrs.de Dr. Uwe Wössner, woessner@hlrs.de Experiences with GPGPUs at HLRS :: 6

Industrial Collaboration: HMI- Tec AI Neuro- Sorter: Enhanced software to analyse, sort and further process written text. Objective: Parallelize using CUDA Existing code based on Boost- library has been rewritten, and optimized for high single core performance. CUDA version tested & compared on NVIDIA Fermi and Tesla. Shows nice speedup up to 30x Only a few weeks of porting effort done as part of a master thesis! Contact: Dr. Rainer Keller, keller@hlrs.de Experiences with GPGPUs at HLRS :: 7

Speedup ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: Industrial Collaboration: HMI- Tec Contact: Dr. Rainer Keller, keller@hlrs.de Factor >20 compared to CPU version (Nehalem) Factor 2 compared to original BOOST based version Speedup: Training phase Pattern: - 3766 words - 3766 input neurons - Vary # inner neurons Data: Zaheer Ahmed Experiences with GPGPUs at HLRS :: 8

Summary of experiences and derived next steps For communities with well developed open source or ISV applications GPGPU deliver already today benefit in time to result and/or flops per Euro à GPGPUs are part of the HLRS offer of academic and industrial users New application areas, in particular if the starting point is non- parallelized code have a high speed- up potential à Seek collaborations with users from academia and industry to leverage this potential Applications combining visualization and computing e.g. interactive or realtime scenarios exploit well the GPGPU architecture What about legacy codes and very huge parallelized applications? à Investigate new emerging programming approaches (HMPP, PGI, Cray Accelerator Compiler) and compare them to CUDA and OpenCL à Large applications needs more stable or standardized environment à Accelerator programming must be more easy for the average developer à Communication between accelerators and host and accelerator must improve Experiences with GPGPUs at HLRS :: 9

THANK YOU! ANY QUESTIONS? Dr. Stefan Wesner wesner@hlrs.de Come and Visit us at the HLRS booth (#134) and at the HPC User Forum 6.-7.10.2011 in Stuttgart Experiences with GPGPUs at HLRS :: 10