HPC Architectures. Types of resource currently in use

Similar documents
Building Blocks. Operating Systems, Processes, Threads

Parallel Programming. Libraries and Implementations

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Programming Libraries and implementations

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

COSC 6385 Computer Architecture - Multi Processor Systems

Reusing this material

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Intel Xeon Phi архитектура, модели программирования, оптимизация.

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Parallel Programming. Libraries and implementations

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

WHY PARALLEL PROCESSING? (CE-401)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

High Performance Computing. What is it used for and why?

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

The Use of Cloud Computing Resources in an HPC Environment

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

High performance computing and numerical modeling

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computing: From Inexpensive Servers to Supercomputers

CFD exercise. Regular domain decomposition

Interconnect Your Future

Introduction to Xeon Phi. Bill Barth January 11, 2013

High Performance Computing. What is it used for and why?

The Stampede is Coming: A New Petascale Resource for the Open Science Community

HPC future trends from a science perspective

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Introduction to Parallel Programming

Introduction to GPU hardware and to CUDA

Building Blocks. Operating Systems, Processes, Threads

Illinois Proposal Considerations Greg Bauer

Intel Knights Landing Hardware

Parallel Architectures

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Advances of parallel computing. Kirill Bogachev May 2016

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Lecture 20: Distributed Memory Parallelism. William Gropp

GPUs and Emerging Architectures

Solutions for Scalable HPC

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Parallel Programming on Ranger and Stampede

LS-DYNA Scalability Analysis on Cray Supercomputers

Umeå University

Umeå University

University at Buffalo Center for Computational Research

Interconnect Your Future

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Real Parallel Computers

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Fractals. Investigating task farms and load imbalance

Technical Computing Suite supporting the hybrid system

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

High Performance Computing (HPC) Introduction

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Accelerating Implicit LS-DYNA with GPU

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Welcome! Virtual tutorial starts at 15:00 BST

Introduction II. Overview

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Early Experiences Writing Performance Portable OpenMP 4 Codes

How To Design a Cluster

Lecture 3: Intro to parallel machines and models

Mapping MPI+X Applications to Multi-GPU Architectures

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Pedraforca: a First ARM + GPU Cluster for HPC

NUMA-aware OpenMP Programming

Real Parallel Computers

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Batch Systems. Running your jobs on an HPC machine

Coherent HyperTransport Enables The Return of the SMP

Generators at the LHC

Welcome. Virtual tutorial starts at BST

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Benchmark results on Knight Landing (KNL) architecture

arxiv: v1 [hep-lat] 1 Dec 2017

Overview of Tianhe-2

COSC 6374 Parallel Computation. Parallel Computer Architectures

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

HIGH PERFORMANCE COMPUTING FROM SUN

Meltdown for Dummies. The road to hell is full of good intentions

The BioHPC Nucleus Cluster & Future Developments

Transcription:

HPC Architectures Types of resource currently in use

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

Outline Shared memory architectures Distributed memory architectures Distributed memory with shared-memory nodes Accelerators Filesystems What is the difference between different tiers of machine? Interconnect Software Job-size bias (capability)

Shared memory architectures Simplest to use, hardest to build

Shared-Memory Architectures Multi-processor shared-memory systems have been common since the early 90 s originally built from many single-core processors multiple sockets sharing a common memory system A single OS controls the entire shared-memory system Modern multicore processors are just shared-memory systems on a single chip can t buy a single core processor even if you wanted one!

Symmetric Multi-Processing Architectures Memory Shared Bus Processor Processor Processor Processor Processor All cores have the same access to memory, e.g. a multicore laptop

Non-Uniform Memory Access Architectures Cores have faster access to their own local memory

Shared-memory architectures Most computers are now shared memory machines due to multicore Some true SMP architectures e.g. BlueGene/Q nodes but most are NUMA Program NUMA as if they are SMP details hidden from the user all cores controlled by a single OS Difficult to build shared-memory systems with large core numbers (> 1024 cores) Expensive and power hungry Difficult to scale the OS to this level

Distributed memory architectures Clusters and interconnects

Multiple Computers Processor Processor Processor Each selfcontained part is called a node. Processor Interconnect Processor each node runs its own copy of the OS Processor Processor Processor

Distributed-memory architectures Almost all HPC machines are distributed memory The performance of parallel programs often depends on the interconnect performance Although once it is of a certain (high) quality, applications usually reveal themselves to be CPU, memory or IO bound Low quality interconnects (e.g. 10Mb/s 1Gb/s Ethernet) do not usually provide the performance required Specialist interconnects are required to produce the largest supercomputers. e.g. Cray Aries, IBM BlueGene/Q Infiniband is dominant on smaller systems. High bandwidth relatively easy to achieve low latency is usually more important and harder to achieve

Distributed/shared memory hybrids Almost everything now falls into this class

Multicore nodes In a real system: each node will be a shared-memory system e.g. a multicore processor the network will have some specific topology e.g. a regular grid

Hybrid architectures Now normal to have NUMA nodes e.g. multi-socket systems with multicore processors Each node still runs a single copy of the OS

Hybrid architectures Almost all HPC machines fall in this class Most applications use a message-passing (MPI) model for programming Usually use a single process per core Increased use of hybrid message-passing + shared memory (MPI+OpenMP) programming Usually use 1 or more processes per NUMA region and then the appropriate number of shared-memory threads to occupy all the cores Placement of processes and threads can become complicated on these machines

Examples ARCHER has two 12-way multicore processors per node 2 x 2.7 GHz Intel E5-2697 v2 (Ivy Bridge) processors each node is a 24-core, shared-memory, NUMA machine each node controlled by a single copy of Linux 4920 nodes connected by the high-speed ARIES Cray network Cirrus has two 18-way multicore processors per node 2 x 2.1 GHz Intel E5-2695 v4 (Broadwell) processors each node is a 36-core, shared-memory, NUMA machine each node controlled by a single copy of Linux 280 nodes connected by the high-speed Infiniband (IB) fabric

Accelerators How are they incorporated?

Including accelerators Accelerators are usually incorporated into HPC machines using the hybrid architecture model A number of accelerators per node Nodes connected using interconnects Communication from accelerator to accelerator depends on the hardware: NVIDIA GPU support direct communication AMD GPU have to communicate via CPU memory Intel Xeon Phi communication via CPU memory Communicating via CPU memory involves lots of extra copy operations and is usually very slow

Example: ARCHER KNL 12 nodes with Knights Landing (Xeon Phi) recently added Each node has a 64-core KNL 4 concurrent hyper-threads per core Each node has 96GB RAM and each KNL has 16GB on chip memory The KNL is self hosted, i.e. in place of the CPU Parallelism via shared memory (OpenMP) or message passing (MPI) Can do internode parallelism via message passing Specific considerations needed for good performance

Filesystems How is data stored?

High performance IO We have focused on the significant computation power of HPC machines so far It is important that writing to and reading from the filesystem does not negate this High performance filesystems Such as Lustre Computational nodes are typically diskless and connect via the network to the filesystem Due to its size this high performance filesystem is often NOT backed up Connected to the nodes in two common ways Not a silver bullet! There are lots of configuration and IO techniques which need to be leveraged for good performance and these are beyond the scope of this course.

Single unified filesystem One filesystem for the entire machine, All parts of the machine can see this and all files are stored in this system E.g. Cirrus (406 TiB Lustre FS) Advantages Conceptually simple as all files are stored on the same filesystem Preparing runs (e.g. compiling code) exhibits good IO performance Disadvantages Lack of backup on the machine This high performance filesystem can get clogged with significant unnecessary data (such as results from post processing/source code.) Login/PP Nodes High performance filesystem Compute Nodes

Multiple disparate filesystems High performance filesystem focused on execution Other filesystems for preparing & compiling code, as well as long term data storage E.g. ARCHER which has an additional (low performance, huge capacity) long term data storage filesystem too Advantages Home filesystem is typically backed up Disadvantages More complex as high performance FS is only one visible from compute nodes High performance FS sometimes called work or scratch Login/PP Nodes Home filesystem High performance filesystem Compute Nodes

Comparison of machine types What is the difference between different tiers of machine?

HPC Facility Tiers HPC facilities are often spoken about as belonging to Tiers Tier 0 Pan-national Facilities Tier 1 National Facilities e.g. ARCHER Tier 2 Regional Facilities e.g. Cirrus Tier 3 Institutional Facilities List of tier 2 facilities at https://www.epsrc.ac.uk/research/facilities/hpc/tier2/

Summary

Summary Vast majority of HPC machines are shared-memory nodes linked by an interconnect. Hybrid HPC architectures combination of shared and distributed memory Most are programmed using a pure MPI model (more later on MPI) - does not really reflect the hardware layout Accelerators are incorporated at the node level Very few applications can use multiple accelerators in a distributed memory model Shared HPC machines span a wide range of sizes: From Tier 0 Multi-petaflops (1 million cores) To workstations with multiple CPUs (+ Accelerators)