High performance computing and numerical modeling

Similar documents
Computing architectures Part 2 TMA4280 Introduction to Supercomputing

PiTP Summer School 2009

HPC Architectures. Types of resource currently in use

ECE 574 Cluster Computing Lecture 23

Parallel and High Performance Computing CSE 745

Experiences with ENZO on the Intel Many Integrated Core Architecture

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Introduction II. Overview

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Timothy Lanfear, NVIDIA HPC

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Trends and Challenges in Multicore Programming

PRACE Autumn School Basic Programming Models

Parallel Architectures

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Parallel Computing. Hwansoo Han (SKKU)

Trends in HPC (hardware complexity and software challenges)

HPC Algorithms and Applications

WHY PARALLEL PROCESSING? (CE-401)

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Chapter 3 Parallel Software

CS377P Programming for Performance Multicore Performance Multithreading

High Performance Computing (HPC) Introduction

Parallel Programming with OpenMP

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Intro to Parallel Computing

The Art of Parallel Processing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

GPUs and Emerging Architectures

Carlo Cavazzoni, HPC department, CINECA

An Introduction to Parallel Programming

CUDA GPGPU Workshop 2012

A low memory footprint OpenCL simulation of short-range particle interactions

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Intel Many Integrated Core (MIC) Architecture

Parallel Systems. Project topics

Parallel Computers. c R. Leduc

Parallelism. Parallel Hardware. Introduction to Computer Systems

A4. Intro to Parallel Computing

Hybrid programming with MPI and OpenMP On the way to exascale

Multicore and Multiprocessor Systems: Part I

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

HPC Technology Trends

Programming Models for Supercomputing in the Era of Multicore

The Era of Heterogeneous Computing

Parallel Programming on Ranger and Stampede

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Parallel Computing Why & How?

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

OP2 FOR MANY-CORE ARCHITECTURES

CME 213 S PRING Eric Darve

High-Performance and Parallel Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

COSC 6385 Computer Architecture - Multi Processor Systems

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Lecture 13. Shared memory: Architecture and programming

Pedraforca: a First ARM + GPU Cluster for HPC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

HPC code modernization with Intel development tools

Parallel Computing. November 20, W.Homberg

Lecture 3: Intro to parallel machines and models

Parallel Programming. Michael Gerndt Technische Universität München

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Introduction to Parallel Programming

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

General introduction: GPUs and the realm of parallel architectures

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Intel Knights Landing Hardware

HPC future trends from a science perspective

Advanced Parallel Programming. Is there life beyond MPI?

Vectorisation and Portable Programming using OpenCL

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Transcription:

High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic gas dynamics Lecture 4: Smoothed particle hydrodynamics Lecture 5: Eulerian hydrodynamics Lecture 6: Moving-mesh techniques Lecture 7: Towards high dynamic range Lecture 8: Parallelization techniques and current computing trends 43rd Saas Fee Course Villars-Sur-Ollon, March 2013

Future progress with cosmological simulations requires... Better resolution (more computing power...) Higher accuracy of numerical codes More complete and realistic physics models

The number of cores on the top supercomputers grows exponentially EXTREME GROWTH OF PARALLELISM (figure by G. Sutmann)

Currently typical supercomputers carry out about ~1-10 Petaflops JUGENE IN JUELICH

Millennium-XXL Largest high-resolution N-body simulation 303 billion particles L = 3 Gpc/h ~700 million halos at z=0 ~25 billion (sub)halos in mergers trees mp = 6.1 x 109 M /h 12288 cores, 30 TB RAM on Supercomputer JuRoPa in Juelich 2.7 million CPU-hours Angulo et al. (2011)

Trouble ahead in the Exaflop regime? How long would the Millennium-XXL take on a Exaflop Supercomputer at peak performance? 15 min One of the main problems: Power Consumption Petaflop Computer: 6 MW Exaflop Computer: ~ GW? Need to get this down to 20-40 MW

Parallelization with distributed memory Many compute nodes connected through a fast network. Programming usually through Message Passing Library (MPI) Beowulf cluster However, some higher level languages allow different programming models. For example PGAS languages: Unified Parallel C Coarray Fortran Chapel Or languages with a parallel runtime, like Charm++

These days, the compute nodes are usually SMP or ccnuma machines MULTIPLE SOCKETS, MULTIPLE CORES, AND MEMORY BANKS A node with 4 sockets and 4 cores, in total 16 physical cores

Symmetric multi-threading (SMT) and hyperthreading (HT) of some modern CPUs add virtual cores INCREASING THE USE OF CORE RESOURCES THROUGH HARDWARE ASSISTED THREADS Intel and IBM produce CPUs whose cores support hardware threads. Bluegene/Q has 16 cores per node, each with 4 hardware threads, hence one should run 64 threads for full use.

Current multi-socket SMP nodes usually have a non-uniform memory access DIFFERENT MEMORY ARCHITECTURES compute node uniform memory access compute node non-uniform memory access (NUMA)

Current multi-socket SMP nodes usually have a non-uniform memory access DIFFERENT MEMORY ARCHITECTURES compute node uniform memory access compute node non-uniform memory access (NUMA)

Developing MPI programs often requires severe algorithmic changes PROGRAMMING MODEL OF MPI Data decomposition and all communication in the parallel code needs to be coded explicitly. Doing this well represents represents a steep learning curve. MPI is well standardized and portable, but current MPI libraries face scaling challenges with respect to their memory need when all-to-all communication patters occur.

The shared memory of a compute nodes allow a simplified parallel programming model through multi-threading MULTI-THREADING UNIX processes are isolated from each other - they have there own protected memory. A process can however be split up into multiple execution paths, so-called threads, allowing lightweight parallelism where data-sharing is trivially achieved. Example for pthreads Threads can be created by the programmer explicitly through the pthreads system calls, or via OpenMP. The two loops will be executed concurrently by two different threads on two different physical cores (if available)

In OpenMP, one can easily create and destroy threads through a simple language extension THE FORK-JOIN PROGRAMMING MODEL OF OPENMP Compilers exist for C/C++ and Fortran Allows easy parallelization of existing code Thanks to shared memory synchronization and data exchange, work-load imbalance in parallel sections can be avoided Reduction of memory overhead needed for ghost cells, bookkeeping data, etc. Example for OpenMP

Shared memory can be easily used for near perfect loop-level parallelism USING MULTIPLE CORES WITH THREADS multi threaded quad-core node single threaded cpu 0 cpu 0 cpu 1 cpu 2 cpu 3 wallclock time timestep wallclock time MPI tasks cpu 2 cpu 3 timestep quad-core node cpu 1 MPI task Threads Threads are light-weight. Unlike processes, the creation/destruction takes almost no time. They inherit all global variables and resources (e.g. open file) from their parent process/thread. Mutual exclusion looks need to be used where needed to avoid race conditions. How to get them? POSIX/System-V Threads OpenMP

Ordinary main-stream processors have acquired powerful vector extensions THE SSE/AVX INSTRUCTIONS OF INTEL AND AMD PROCESSORS AVX, introduced in Intel Sandy bridge, offers registers that are 256 bit wide. With those, one can now do: 4 double precision calculations in parallel, or 8 single precision calculations in parallel Essentially at the same time as the usual instructions take for a single operations. If these instructions can be efficiently used, substantial speed ups are possible. To use them, one either trusts that the compiler will use in an optimum way (wishful thinking), or one can program them using instrinsics... In the Intel Xeon Phi Accelerator Cards, the width of these vector instructions has been doubled yet again to 512 bit. vector intrinsics

Device computing is a new trend to exploit the extreme raw computational speed of GPUs or Xeon-Phi acclerator cards HETEROGENOUS COMPUTING Ideally, all threads must execute the same code, otherwise full performance can't be reached. Branching needs to be minimized among the threads. (figure by D. Aubert)

(D. Aubert)

High performance computing will offer unprecedented opportunities for astrophysics in the coming years... and a novel programming complexity WHO ORDERED HYBRID COMPUTING? Parallelization through: All of this needs to be combined! We need MPI+OpenMP+GPU codes MPI OpenMP GPU SSE/AVX Roadmap to Exascale But do our problems really have a degree of parallelism that allows the use of a 1 billion threads? It may be that such machines can only be used at scale by science disciplines with computationally simpler problems compared to what we have in galaxy formation...