Carlo Cavazzoni, HPC department, CINECA

Similar documents
Trends in HPC Architectures and Parallel

Exascale: challenges and opportunities in a power constrained world

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Trends in HPC Architectures and Parallel

OpenFOAM on BG/Q porting and performance

HPC Architectures past,present and emerging trends

IBM Blue Gene/Q solution

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GPUs and Emerging Architectures

HPC Architectures past,present and emerging trends

High Performance Computing (HPC) Introduction

Trends and Challenges in Multicore Programming

Lecture 20: Distributed Memory Parallelism. William Gropp

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HPC Cineca Infrastructure: State of the art and towards the exascale

Technology for a better society. hetcomp.com

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Parallel Computing. Hwansoo Han (SKKU)

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Parallel Hybrid Computing Stéphane Bihan, CAPS

I/O: State of the art and Future developments

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Philippe Thierry Sr Staff Engineer Intel Corp.

High-Performance Computing & Simulations in Quantum Many-Body Systems PART I. Thomas Schulthess

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Hybrid programming with MPI and OpenMP On the way to exascale

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Advanced Parallel Programming I

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

A Lightweight OpenMP Runtime

The Art of Parallel Processing

Advances of parallel computing. Kirill Bogachev May 2016

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Trends in HPC (hardware complexity and software challenges)

Mixed MPI-OpenMP EUROBEN kernels

Cray XE6 Performance Workshop

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

LQCD Computing at BNL

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Parallel Computing Architectures

GPUS FOR NGVLA. M Clark, April 2015

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Modern CPU Architectures

IBM CORAL HPC System Solution

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

General Purpose GPU Computing in Partial Wave Analysis

Trends in HPC Architectures

What is Parallel Computing?

Mapping MPI+X Applications to Multi-GPU Architectures

Pedraforca: a First ARM + GPU Cluster for HPC

What SMT can do for You. John Hague, IBM Consultant Oct 06

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Concurrent High Performance Processor design: From Logic to PD in Parallel

Overview of Tianhe-2

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

The Era of Heterogeneous Computing

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Parallel and High Performance Computing CSE 745

MPI+X on The Way to Exascale. William Gropp

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CP2K Performance Benchmark and Profiling. April 2011

Modernizing OpenMP for an Accelerated World

CUDA GPGPU Workshop 2012

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

EE/CSCI 451: Parallel and Distributed Computation

What does Heterogeneity bring?

LS-DYNA Performance Benchmark and Profiling. October 2017

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Fujitsu s Approach to Application Centric Petascale Computing

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Parallel Computing Why & How?

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Application Performance on IME

Benchmark results on Knight Landing (KNL) architecture

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Introduction of Fujitsu s next-generation supercomputer

Parallel I/O on JUQUEEN

Directions in HPC Technology

IBM High Performance Computing Toolkit

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Parallel Languages: Past, Present and Future

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

User Training Cray XC40 IITM, Pune

CS427 Multicore Architecture and Parallel Computing

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

The Future of GPU Computing

High performance computing and numerical modeling

Introduction to GPU computing

Parallel Computing Architectures

Transcription:

Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA

Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators -> hybrid architectures

Distributed Memory memory memory memory NETWORK memory memory memory

Shared Memory memory

Mixed Architectures memory memory memory NETWORK

Most Common Networks switched Cube, hypercube, n-cube switch Torus in 1,2,...,N Dim Fat Tree

Roadmap to Exascale (architectural trends)

Dennard scaling law new gen. old gen. L = L / 2 V = V / 2 F = F * 2 D = 1 / L 2 = 4D P = P do not hold anymore! The core frequency and performance do not grow following the Moore s law any longer L = L / 2 V = ~V F = ~F * 2 D = 1 / L 2 = 4 * D P = 4 * P The power crisis! Increase the number of cores to maintain the architectures evolution on the Moore s law Programming crisis!

MPI inter process communications MPI on Multi core 1 MPI proces / core Stress network Stress OS MPI_BCAST Many MPI codes (QE) based on ALLTOALL Messages = processes * processes We need to exploit the hierarchy network Re-design applications Mix message passing And multi-threading

What about Applications? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001 FERMI (BGQ) total concurrency: 655360 (65536/rack)

Programming Models Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages UPC, Coarray Fortran, Titanium Next Generation Programming Languages and Models Chapel, X10, Fortress Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL

MPI + OpenMP MPI (inter ) OpenMP (intra ) Distributed memory systems message passing data distribution model Version 2.1 (09/08) API for C/C++ and Fortran Shared memory systems Threads creations relaxed-consistency model Version 3.0 (05/08) Compiler directive C and Fortran

Shared memory Thread 0 y Thread 1 Thread 2 Thread 3 memory x

Shared Memory: OpenMP Main Characteristic Compiler directives Medium grain Intra parallelization (pthreads) Loop or iteration partition Shared memory Many HPC App Open Issue Thread creation overhead Memory/core affinity Interface with MPI

OpenMP!$omp parallel do do i = 1, nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do!$omp end parallel do call fw_scatter (... )!$omp parallel do i = 1, nzl!$omp parallel do do j = 1, Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do!$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do end do!$omp end parallel

Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s FERMI @ CINECA PRACE Tier-0 System ISCRA & PRACE call for projects now open!

1. Chip: 16 P cores 2. Single Chip Module 3. Compute card: One chip module, 16 GB DDR3 Memory, 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips, Torus 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 7. System: 20PF/s 5a. Midplane: 16 Node Cards 6. Rack: 2 Midplanes

BG/Q I/O architecture PCI_E IB or 10GE IB or 10GE BG/Q compute racks BG/Q IO Switch RAID Storage & File Servers BlueGene Classic I/O with GPFS clients on the logical I/O s Similar to BG/L and BG/P Uses InfiniBand switch Uses DDN RAID controllers and File Servers BG/Q I/O Nodes are not shared between compute partitions IO Nodes are bridge data from function-shipped I/O calls to parallel file system client Components balanced to allow a specified minimum compute partition size to saturate entire storage array I/O bandwidth

E (twin direction, always 2) B Node Board (32 Compute Nodes): 2x2x2x2x2 C A 04 07 (0,0,0,0,1)) 03 05 06 15 31 02 13 30 12 1 0 29 (0,0,0,0,0) 25 26 24 27 28 08 11 09 10 14 (0,0,0,1,0) 21 18 22 20 17 19 23 16 D

PowerA2 chip, basic info 16 cores + 1 + 1 1.6GHz 32MByte cache system-on-a-chip design 16GByte of RAM at 1.33GHz Peak Perf 204.8 gigaflops power draw of 55 watts 45 nanometer copper/soi process (same as Power7) Water Cooled

PowerA2 core 4 FPU 4 way SMT 64-bit instruction set in-order dispatch, execution, and completion 16KB of L1 data cache 16KB of L1 instructions cache

PowerA2 FPU Each FPU on each core has four pipelines execute scalar floating point instructions Quad pumped four-wide SIMD instructions two-wide complex arithmetic SIMD inst. six-stage pipeline permute instructions maximum of eight concurrent floating point operations per clock plus a load and a store.