Lecture 20: Distributed Memory Parallelism. William Gropp

Similar documents
Preparing GPU-Accelerated Applications for the Summit Supercomputer

Blue Waters System Overview. Greg Bauer

Lecture 2 Parallel Programming Platforms

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Carlo Cavazzoni, HPC department, CINECA

BlueGene/L. Computer Science, University of Warwick. Source: IBM

HPC Architectures. Types of resource currently in use

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Engineering Performance for Multiphysics Applications. William Gropp

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Our Workshop Environment

Blue Waters System Overview

The Impact of Optics on HPC System Interconnects

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Lecture 2: Topology - I

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

GPUs and Emerging Architectures

Real Parallel Computers

Parallel Architectures

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Cray XE6 Performance Workshop

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Interconnection Networks

GPU Architecture. Alan Gray EPCC The University of Edinburgh

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Oak Ridge National Laboratory Computing and Computational Sciences

Real Parallel Computers

Parallel Computer Architecture II

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

User Training Cray XC40 IITM, Pune

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

MPI in 2020: Opportunities and Challenges. William Gropp

The Future of High Performance Interconnects

COSC 6374 Parallel Computation. Parallel Computer Architectures

Physical Organization of Parallel Platforms. Alexandre David

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6374 Parallel Computation. Parallel Computer Architectures

UCX: An Open Source Framework for HPC Network APIs and Beyond

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Early experience with Blue Gene/P. Jonathan Follows IBM United Kingdom Limited HPCx Annual Seminar 26th. November 2007

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

4. Networks. in parallel computers. Advances in Computer Architecture

High Performance Computing (HPC) Introduction

IBM CORAL HPC System Solution

IBM Blue Gene/Q solution

Mapping MPI+X Applications to Multi-GPU Architectures

Cray XC Scalability and the Aries Network Tony Ford

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Scalability and Classifications

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

The Tofu Interconnect D

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Interconnect Your Future

MPI+X on The Way to Exascale. William Gropp

Parallel Architectures

2014 LENOVO. ALL RIGHTS RESERVED.

Illinois Proposal Considerations Greg Bauer

Cluster Network Products

What is Parallel Computing?

CS377P Programming for Performance Multicore Performance Multithreading

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

MPI+X on The Way to Exascale. William Gropp

Interconnect Your Future

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Types of Parallel Computers

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 10. Routing and Flow Control

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Topology Awareness in the Tofu Interconnect Series

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Interconnection Networks

1. NoCs: What s the point?

The Road from Peta to ExaFlop

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Modern computer architecture. From multicore to petaflops

Early Experiences Writing Performance Portable OpenMP 4 Codes

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Computational Challenges and Opportunities for Nuclear Astrophysics

Chapter 9 Multiprocessors

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Execution Models for the Exascale Era

DATARMOR: Comment s'y préparer? Tina Odaka

Network-on-chip (NOC) Topologies

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Present and Future Leadership Computers at OLCF

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Transcription:

Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp

A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order to get our terminology straight. You may already know this, but it will help get us synchronized (a term from parallel computing). 2

The Starting Point: A Single Computer A single ( normal ) computer, circa 2000 Program: single sequence of instructions Usual programming languages compiled into instructions understood by Most programming models, including parallel ones, assumed this as the computer Process: program + address space + program counter + stack 3

A Bunch of Serial Computers... Program: Many instances of same sequential program operating on different data in difference address spaces Multiple independent processes with same program Examples High throughput computing seti@home and folding@home 4

Multiple s per System May use multiple processes with own memory plus some way of sharing memory and coordinating access (operating-system-specific) One process may have multiple threads (pc + stack) s operating on same address space, guided by same program Sequential programming languages with smart compilers Threads; OpenMP, OpenACC versions of C and Fortran 5

Hardware Variations on Multiple 's per System SMPs (Symmetric Multiprocessors) Separate chips circa 1980-2000 caches Single chip (all chips today) Multicore Chips (can be many) 6

Multiple Systems Network Program: multiple processes, each of which may be multithreaded, with separate address space Program needs some way of expressing communication among processes, so data can move between address spaces Many large-scale machines look like this, leading to more interest in programming models that combine the shared and distributed memory approaches to programming. 7

Multiple Systems Network Most systems have several chips sharing memory, physically placed on the same board (or part of a board) We ll call this a node Beware: Definitions change with time (what we call a core used to be a processor) and often carry assumptions Nodes in the near future may not share memory or may not provide cache-coherent shared memory, even within a single chip 8

What about that Network? The network or interconnect is what distinguishes a distributed memory parallel computer Important characteristics of parallel computer interconnects include: How are the nodes connected together This is the topology of the network How are the nodes attached to the network? What is the performance of the network? One important performance measure is the bisection bandwidth: If I cut the network in half, and add up the bandwidth across all of the cut links, what is the minimum value across all possible cuts? 9

A Simple Interconnect Each node (vertex in the graph, shown as a circle) connected to nearest neighbors in x, y direction (but not diagonal) Data routed from one node to the other by traveling over one or more links (edges of the graph) Node Link 10

Bisection Bandwidth Try all cuts in half, take minimum In this case, bisection bandwidth is 3*individual link bandwidth Node Link 11

Mesh and Torus Interconnects Each compute node is a node on a mesh or torus May be 1, 2, 3 or more dimensions Torus simply adds links connecting ends together Clever physical layout keeps these links the same length Note a rectangular subset of a mesh is a mesh, but a rectangular subset of a torus is not necessarily a torus (it s a mesh) 12

Mesh and Torus Features Advantages Constant cost to scale to more nodes Simple routing algorithms Relatively easy to reason about performance Matches certain problems well Disadvantages Bisection bandwidth does not scale with size For general communication patterns, network contention limits performance 13

Question How does the bisection bandwidth of a 2D mesh scale with the number of nodes? Consider an n x n mesh (so n 2 nodes) and compute the bisection bandwidth per node as a function of n and the link bandwidth L. 14

Overview of Blue Waters Cray XE6 Nodes HT3 HT3 Blue Waters contains 22,640 Cray XE6 compute nodes. Dual-socket Node Two AMD Interlagos chips 16 core modules, 64 threads 313.6 GFs peak performance 64 GBs memory - 102 GB/sec memory bandwidth Gemini Interconnect Router chip & network interface Injection Bandwidth (peak) - 9.6 GB/sec per direction 2 links in x, z, direction, only 1 link in y direction 15

Overview of Blue Waters Cray XK7 Nodes HT3 HT3 PCIe Gen2 Dual-socket Node One AMD Interlagos chip Same as XE6 nodes One NVIDIA Kepler chip >1 TF peak performance 6 GBs GDDR5 memory - 180 GB/sec bandwidth Gemini Interconnect Same as XE6 nodes Blue Waters contains 4,224 Cray XK7 compute nodes. 16

Overview of Blue Waters Gemini Interconnect Network Blue Waters 3D Torus Size 24 x 24 x 24 InfiniBand Login Servers Network(s) Y GigE Fibre Channel SMW X Z Interconnect Network Infiniband Boot Raid Lustre Service Nodes spread throughout the torus Compute Nodes Cray XE6 Compute Cray XK7 Accelerator Service Nodes Operating System Login/Network Boot Login Gateways System Database Network Lustre File System LNET Routers 17

Multilevel Networks Network made up of switches with many ports that are connected to Compute nodes Other switches Modern switches can handle many ports simultaneously (> 100) 18

Example Fat Tree Network From http://www-sop.inria.fr/members/fabien.hermenier/btrpcc/img/ fat_tree.png 19

Features of Multilevel Advantages Networks Bisection bandwidth can be high; increase by adding more switches Hops between nodes small Disadvantages Cost grows faster than linear with number of compute nodes More complex; harder to reason about performance 20

Multilevel Networks Part 2 As you move up in level (away from the compute nodes) it is common to need more bandwidth between the switches Solution: Use higher bandwidth links, especially optical links Electrical links are cheap, fast over short distances Optical links have enormous bandwidth, even over long distances, but are expensive 21

IH Server Node (Drawer) and Supernode Power7 QCMs Hub Chips Specifications: 8 Quad-chip Modules Up to 8 Tflops 1 TByte of memory 4 TB/s memory 8 Hub Chips 9 TB/sec/Hub Packaging: 2U Drawer 39 w x 72 d > 300 lbs. Fully water cooled (QCMs, Hubs, and ) Specifications: 4 IH Server Nodes Up to 32 Tflops 4 TBytes of memory 16 TB/s 32 Hub Chips 36 TB/sec 22

Logical View of IBM PERCS Interconnect drawer Full direct connectivity IBM Power7 IBM Power7 IBM Power7 IBM Power7 Integrated Switch Router Power7 Coherency Bus HFI CAU HFI Integrated Switch Router L local links within drawers (336 GB/s) 23

Logical View of IBM PERCS Interconnect supernode drawer Full direct connectivity IBM Power7 IBM Power7 IBM Power7 IBM Power7 Integrated Switch Router Power7 Coherency Bus HFI CAU HFI Integrated Switch Router 1... 24 L local links within drawers (336 GB/s) L remote links between drawers in supernode (240 GB/s) 24

Logical View of IBM PERCS Interconnect supernode drawer Full direct connectivity IBM Power7 IBM Power7 IBM Power7 IBM Power7 Integrated Switch Router Power7 Coherency Bus HFI CAU HFI Integrated Switch Router 1... 24 1... 16 L local links within drawers (336 GB/s) L remote links between drawers in supernode (240 GB/s) D-links between.. supernodes. (320 GB/s) 1 2 3 N ( 512) 25

Two-level (L, D) Direct-connect Network QCM QCM Supernode Supernode QCM QCM QCM Each Supernode = 32 QCMs (4 Drawers x 8 SMPs/Drawer) Fully Interconnected with L local and L remote Links QCM QCM Result: Very low hardware latency Very high bandwidth 26 Supernode Supernode Supernode Supernode Supernode Original Blue Waters Plan = 320 Supernodes (40 BBs x 8 SNs/BB) Fully Interconnected with D Links But complex, nonuniform network

Another Multilevel Network Cray Ares network uses a similar approach called a dragonfly network Uses mixed electrical/optical connections Roughly, a hierarchy of completely connected networks; no more than 4 hops for the sizes considered Presentation on Cray XC30 http://www.nersc.gov/assets/uploads/ NERSC.XC30.overview.pdf Patent for Dragonfly Network http://www.faqs.org/patents/app/ 20100049942 27

Mesh and Torus Revisited We ve assumed that the node communicates with the mesh through a vertex in the mesh Node Node Instead, the node could communicate with the mesh directly, through the edges (links) from the vertex Node Node Node Node Node Node Node Node Node 28

IBM BlueGene Series IBM BlueGene use mesh Node occupies a vertex, directly manages each link (edge) from that vertex separately Each link of modest speed; must drive them all to get comparable bandwidth to other systems 29

BG/Q Midplane, 512 nodes, 4x4x4x4x2 Torus 30

More Information on BG/Q Nice introduction to ANL s BG/Q, including details of memory system See esp. slide 18 on network http:// extremecomputingtraining.anl.gov/ files/2014/01/20140804-atpescparker-bgq-arch.pdf 31

Why This Focus on the Interconnect? Distributed memory parallel computers are just regular computers, nodes programmed like any other Designing for and programming the distributed memory means thinking about how data moves between the compute nodes 32

Performance Model A simple and often adequate performance model for the time to move data between nodes is T = latency + length / bandwidth T = s + r n, r = 1/bandwidth On modern HPC systems, latency is 1-10usec and bandwidths are 0.1 to 10 GB/sec This model has many limitations but is often adequate E.g., does not include the effect of distance or of contention with other messages We ll discuss other models later 33

Questions What sort of network does your system have? Read up on the Blue Waters (Cray Gemini) interconnect 34

Questions Consider a 1-dimensional torus where each link has bandwidth b. Consider the following cases: 1. Each node sends n bytes to the node to the immediate left 2. Each node sends n bytes to the node k links away to the left All communication starts at the same time. Ignore latency. In each of the above cases, how long does it take the communication to complete, in terms of n and b? In the second case, what is the effective bandwidth b for the communication (e.g., if you wanted to use the T=s+n/b model, what value of b should you pick as a function of k)? 35