Trends in HPC Architectures and Parallel

Similar documents
Trends in HPC Architectures and Parallel

Carlo Cavazzoni, HPC department, CINECA

HPC Architectures past,present and emerging trends

CINECA and the European HPC Ecosystem

HPC Architectures past,present and emerging trends

Exascale: challenges and opportunities in a power constrained world

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Trends in HPC (hardware complexity and software challenges)

High Performance Computing (HPC) Introduction

Introduction to GPU computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to High Performance Computing

HPC IN EUROPE. Organisation of public HPC resources

Supercomputer Architecture

Fujitsu s Approach to Application Centric Petascale Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to High Performance Computing. Giovanni Erbacci - Supercomputing Applications & Innovation Department - CINECA

The Mont-Blanc approach towards Exascale

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

What does Heterogeneity bring?

WHY PARALLEL PROCESSING? (CE-401)

Practical Scientific Computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

GPUs and Emerging Architectures

High-Performance Computing & Simulations in Quantum Many-Body Systems PART I. Thomas Schulthess

High Performance Computing from an EU perspective

High performance computing and numerical modeling

Fabio AFFINITO.

Trends and Challenges in Multicore Programming

Parallel Computing. Hwansoo Han (SKKU)

Lecture 9: MIMD Architecture

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

HPC Resources & Training

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 "

What are Clusters? Why Clusters? - a Short History

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Steve Scott, Tesla CTO SC 11 November 15, 2011

Lecture 9: MIMD Architectures

Parallel Computing Why & How?

The Art of Parallel Processing

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

High Performance Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

FUSION PROCESSORS AND HPC

Introduction to High-Performance Computing

Building supercomputers from embedded technologies

The way toward peta-flops

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

The IBM Blue Gene/Q: Application performance, scalability and optimisation

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Experts in Application Acceleration Synective Labs AB

European energy efficient supercomputer project

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

CS4961 Parallel Programming. Lecture 1: Introduction 08/25/2009. Course Details. Mary Hall August 25, Today s Lecture.

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Parallelism. CS6787 Lecture 8 Fall 2017

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Parallel Computing Basics, Semantics

Advanced Topics in Computer Architecture

Multi-core Programming - Introduction

Timothy Lanfear, NVIDIA HPC

Výpočetní zdroje IT4Innovations a PRACE pro využití ve vědě a výzkumu

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

How to Write Fast Code , spring th Lecture, Mar. 31 st

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

BİL 542 Parallel Computing

Los Alamos National Laboratory Modeling & Simulation

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 "

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Introduction to High-Performance Computing

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Convergence of Parallel Architecture

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Using Graphics Chips for General Purpose Computation

Parallel Computing Platforms

Accelerating CFD with Graphics Hardware

Fundamentals of Computer Design

OpenMP for next generation heterogeneous clusters

HPC Technology Trends

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

An Introduction to Parallel Programming

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture!

Technology for a better society. hetcomp.com

Parallel Systems I The GPU architecture. Jan Lemeire

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Transcription:

Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - g.erbacci@cineca.it Supercomputing, Applications & Innovation Department - CINECA

Agenda - Computational Sciences - Trends in Parallel Architectures - Trends in Parallel Programming - PRACE 1

Computational Sciences Computational science (with theory and experimentation), is the third pillar of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation: Instantaneous communication Geographically distributed work Increased productivity More data everywhere Increasing problem complexity Innovation happens worldwide 2

Technology Evolution More data everywhere: Radar, satellites, CAT scans, weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity: Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges. HPC Evolution As technology allows scientists to handle bigger datasets and faster computations, they push to solve harder problems. In turn, the new class of problems drives the next cycle of technology innovation. 3

Computational Sciences today Multidisciplinary problems Coupled applicatitions - Full simulation of engineering systems - Full simulation of biological systems - Astrophysics - Materials science - Bio-informatics, proteomics, pharmaco-genetics - Scientifically accurate 3D functional models of the human body - Biodiversity and biocomplexity - Climate and Atmospheric Research - Energy - Digital libraries for science and engineering Large amount of data Complex mathematical models 4

HPC Architectures Performance Comes from: Device Technology Logic switching speed and device density Memory capacity and access time Communications bandwidth and latency Computer Architecture Instruction issue rate Execution pipelining Reservation stations Branch prediction Cache management Parallelism Number of operations per cycle per processor Instruction level parallelism (ILP) Vector processing Number of processors per node Number of nodes in a system CPU 1 CPU 2 CPU 3 Network memory memory memory CPU 1 CPU 2 CPU 3 memory memory memory Network memory memory memory CPU 1 CPU 2 CPU 3 Network Shared Memory Systems Distributed Shared Memory Distributed Memory Systems 5

HPC Architectures: Parameters affecting Performance Peak floating point performance Main memory capacity Bi-section bandwidth I/O bandwidth Secondary storage capacity Organization Class of system # nodes # processors per node Accelerators Network topology Control strategy MIMD Vector, PVP SIMD SPMD 6

HPC systems evolution Vector Processors Cray-1 SIMD, Array Processors Goodyear MPP, MasPar 1 & 2, TMC CM-2 Parallel Vector Processors (PVP) Cray XMP, YMP, C90 NEC Earth Simulator, SX-6 Massively Parallel Processors (MPP) Cray T3D, T3E, TMC CM-5, Blue Gene/L Commodity Clusters Beowulf-class PC/Linux clusters Constellations Distributed Shared Memory (DSM) SGI Origin HP Superdome Hybrid HPC Systems Roadrunner Chinese Tianhe-1A system GPGPU systems 7

HPC systems evolution in CINECA 1969: CDC 6600 1 st system for scientific computing 1975: CDC 7600 1 st supercomputer 1985: Cray X-MP / 4 8 1 st vector supercomputer 1989: Cray Y-MP / 4 64 1993: Cray C-90 / 2 128 1994: Cray T3D 64 1 st parallel supercomputer 1995: Cray T3D 128 1998: Cray T3E 256 1 st MPP supercomputer 2002: IBM SP4 512 1 Teraflops 2005: IBM SP5 512 2006: IBM BCX 10 Teraflops 2009: IBM SP6 100 Teraflops 2012: IBM BG/Q 2 Petaflops 8

BG/Q in CINECA 10 BGQ Frame 10240 nodes 1 PowerA2 processor per node 16 core per processor 163840 cores 1GByte / core 2PByte of scratch space 2PFlop/s peak performance 1MWatt liquid cooled -16 core chip @ 1.6 GHz - a crossbar switch links the cores and L2 cache memory together. - 5D torus interconnect The Power A2 core has a 64-bit instruction set (unlike the prior 32-bit PowerPC chips used in BG/L and BG/P The A2 core have four threads and has inorder dispatch, execution, and completion instead of out-of-order execution common in many RISC processor designs. The A2 core has 16KB of L1 data cache and another 16KB of L1 instruction cache. Each core also includes a quad-pumped double-precision floating point unit: Each FPU on each core has four pipelines, which can be used to execute scalar floating point instructions, four-wide SIMD instructions, or two-wide complex arithmetic SIMD instructions. 9

Top 500: some facts 1976 Cray 1 installed at Los Alamos: peak performance 160 MegaFlop/s (10 6 flop/s) 1993 (1 Edition Top 500) N. 1 59.7 GFlop/s (10 12 flop/s) 1997 Teraflop/s barrier (10 12 flop/s) 2008 Petaflop/s (10 15 flop/s): Roadrunner (LANL) Rmax 1026 Gflop/s, Rpeak 1375 Gflop/s hybrid system: 6562 processors dual-core AMD Opteron accelerated with 12240 IBM Cell processors (98 TByte di RAM) 2011 11.2 Petaflop/s : K computer (SPARC64 VIIIfx 2.0GHz, Tofu interconnect) RIKEN Japan - 62% of the systems on the top500 use processors with six or more cores - 39 systems use GPUs as accelerators (35 NVIDIA, 2 Cell, 2 ATI Radeon) 10

HPC Evolution Moore s law is holding, in the number of transistors Transistors on an ASIC still doubling every 18 months at constant cost 15 years of exponential clock rate growth has ended Moore s Law reinterpreted Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) #cores per chip doubles every 18 months instead of clock 64-512 threads per node will become visible soon - Million-way parallelism From Herb Sutter<hsutter@microsoft.com> Heterogeneity: Accelerators -GPGPU - MIC 11

Multi-core Motivation for Multi-Core Exploits improved feature-size and density Increases functional units per chip (spatial efficiency) Limits energy consumption per operation Constrains growth in processor complexity Challenges resulting from multi-core Relies on effective exploitation of multiple-thread parallelism Need for parallel computing model and parallel programming model Aggravates memory wall Memory bandwidth Way to get data out of memory banks Way to get data into multi-core processor array Memory latency Fragments (shared) L3 cache Pins become strangle point Rate of pin growth projected to slow and flatten Rate of bandwidth per pin (pair) projected to grow slowly Requires mechanisms for efficient inter-processor coordination Synchronization Mutual exclusion Context switching 12

Heterogeneous Multicore Architecture Combines different types of processors Each optimized for a different operational modality Performance > Nx better than other N processor types Synthesis favors superior performance For complex computation exhibiting distinct modalities Conventional co-processors Graphical processing units (GPU) Network controllers (NIC) Efforts underway to apply existing special purpose components to general applications Purpose-designed accelerators Integrated to significantly speedup some critical aspect of one or more important classes of computation IBM Cell architecture ClearSpeed SIMD attached array processor 13

Real HPC Crisis is with Software A supercomputer application and software are usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck - Arguably hasn t changed so much since the 70 s Software is a major cost component of modern technologies. - The tradition in HPC system procurement is to assume that the software is free. It s time for a change - Complexity is rising dramatically - Challenges for the applications on Petaflop systems - Improvement of existing codes will become complex and partly impossible - The use of O(100K) cores implies dramatic optimization effort - New paradigm as the support of a hundred threads in one node implies new parallelization strategies - Implementation of new parallel programming methods in existing large applications has not always a promising perspective There is the need for new community codes 14

Roadmap to Exascale(architectural trends) 15

What about parallel App? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

Programming Models Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages UPC, Coarray Fortran, Titanium Next Generation Programming Languages and Models Chapel, X10, Fortress Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL 17

Trends Scalar Application Vector MPP System, Message Passing: MPI Distributed memory Shared Memory Multi core nodes: OpenMP Hybrid codes Accelerator (GPGPU, FPGA): Cuda, OpenCL 18

Message Passing domain decomposition node memory node memory node memory CPU CPU CPU Internal High Performance Network node memory node memory node memory CPU CPU CPU 19

Ghost Cells - Data exchange sub-domain boundaries Processor 1 i,j+1 i-1,j i,j i+1,j Processor 1 i,j+1 i-1,j i,j i+1,j i,j+1 i-1,j i,j i+1,j i,j-1 Ghost Cells Ghost Cells exchanged between processors at every update i,j+1 i-1,j i,j i+1,j i,j-1 Processor 2 i,j+1 i-1,j i,j i+1,j i,j-1 Processor 2 20

Message Passing: MPI Main Characteristic Library Coarse grain Inter node parallelization (few real alternative) Domain partition Distributed Memory Almost all HPC parallel App Open Issue Latency OS jitter Scalability

Shared memory node Thread 0 CPU y Thread 1 Thread 2 Thread 3 CPU CPU CPU memory x 22

Shared Memory: OpenMP Main Characteristic Compiler directives Medium grain Intra node parallelization (pthreads) Loop or iteration partition Shared memory Many HPC App Open Issue Thread creation overhead Memory/core affinity Interface with MPI 23

OpenMP!$omp parallel do do i = 1, nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do!$omp end parallel do call fw_scatter (... )!$omp parallel do i = 1, nzl!$omp parallel do do j = 1, Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do!$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do end do!$omp end parallel 24

Accelerator/GPGPU + Sum of 1D array 25

CUDA - OpenCL Main Characteristic Ad-hoc compiler Fine grain offload parallelization (GPU) Single iteration parallelization Ad-hoc memory Few HPC App Open Issue Memory copy Standard Tools Integration with other languages 26

Hybrid (MPI+OpenMP+CUDA+ Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines +python) 27

Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO http://www.qe-forge.org/ 28

PRACE Partnership for Advanced Computing in Europe http://www.prace-project.eu/ PRACE is part of the ESFRI roadmap and has the aim of creating a European Research Infrastructure providing world class systems and services and coordinating their use throughout Europe. It covers both hardware at the multi petaflop/s level and also very demanding software (parallel applications) to exploit these systems. 29

PRACE Tier 0 Systems - JUGENE - CURIE - HERMIT - SuperMUC - FERMI - Marenostrum 30

PRACE: Objectives Creation and manage a persistent, sustainable pan-european HPC service Deployment of three to five Pflop/s systems at different European sites: Hosting Sites: Germany, France, Italy, Spain First Pflop/s system installed at Juelich in Germany Establish a legal and organisational structure involving HPC centres, national funding agencies, and scientific user communities Develop funding and usage models and establish a peer review process Provide training for European scientists and create a permanent education programme Late 2009: 1PF in the top 5 2011-13: 3 more systems in the top 10 2020: the exaflop in the top 5 Call for access to PRACE HPC resources OPEN till August 2010 31

HPC-Europa 2: Providing access to HPC resources HPC-Europa is a consortium of seven leading HPC infrastructures and five centres of excellence aiming at the integrated provision of advanced computational services to the European research community working at the forefront of science. The services will be delivered at a large spectrum both in terms of access to HPC systems and provision of a suitable computational environment to allow the European researchers to remain competitive with teams elsewhere in the world. Moreover, Joint Research and Networking actions will contribute to foster a culture of cooperation to generate critical mass for computational. http://www.hpc-europa.eu/ HP-Europa 2: 2009 2012 (FP7-INFRASTRUCTURES-2008-1) 32

HPC-Europa 2: Access Provision of transnational access to some of the most powerful HPC facilities in Europe Opportunities to collaborate with scientists working in related fields at a relevant local research institute HPC consultancy from experienced staff Provision of a suitable computational environment Travel costs, subsistence expenses and accommodation 33