Cray XE6 Performance Workshop

Similar documents
Limitations of Memory System Performance

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

HPC Issues for DFT Calculations. Adrian Jackson EPCC

Evolution and Convergence of Parallel Architectures

Learning Curve for Parallel Applications. 500 Fastest Computers

History of Distributed Systems. Joseph Cordina

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture notes for CS Chapter 4 11/27/18

Top500 Supercomputer list

Introduction to Parallel Programming

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Dr. Joe Zhang PDC-3: Parallel Platforms

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS Parallel Algorithms in Scientific Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer Systems Architecture

Three basic multiprocessing issues

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Multiprocessors & Thread Level Parallelism

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

What are Clusters? Why Clusters? - a Short History

Fundamentals of Computer Design

WHY PARALLEL PROCESSING? (CE-401)

Fundamentals of Quantitative Design and Analysis

In the early days of computing, the best way to increase the speed of a computer was to use faster logic devices.

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Cray XE6 Performance Workshop

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Parallel Architectures

Fundamentals of Computers Design

Introduction to Parallel Computing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

COSC 6385 Computer Architecture - Multi Processor Systems

Parallelism. CS6787 Lecture 8 Fall 2017

An Introduction to Parallel Programming

Copyright 2012, Elsevier Inc. All rights reserved.

Parallel Computing Platforms

Online Course Evaluation. What we will do in the last week?

Advanced Parallel Architecture. Annalisa Massini /2017

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

What is Parallel Computing?

CS 475: Parallel Programming Introduction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 20: Distributed Memory Parallelism. William Gropp

Module 5 Introduction to Parallel Processing Systems

Adapted from David Patterson s slides on graduate computer architecture

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

TRIPS: Extending the Range of Programmable Processors

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Introduction to Parallel Programming

Kaisen Lin and Michael Conley

Parallel Computing: Parallel Architectures Jin, Hai

High Performance Computing (HPC) Introduction

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Effect of memory latency

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Carlo Cavazzoni, HPC department, CINECA

Comp. Org II, Spring

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Processing & Multicore computers

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Introduction to parallel computing

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Issues in Multiprocessors

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

BlueGene/L (No. 4 in the Latest Top500 List)

High Performance Computing Systems

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Comp. Org II, Spring

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

4. Shared Memory Parallel Architectures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Chapter 5: Thread-Level Parallelism Part 1

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Why Parallel Architecture

Lecture 1: Introduction

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

CSE 392/CS 378: High-performance Computing - Principles and Practice

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Handout 3 Multiprocessor and thread level parallelism

Memory Systems IRAM. Principle of IRAM

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Transcription:

Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed emory Shared emory Clusters Summary odern HC Architectures 2 1

Building Blocks of arallel achines rocessors to calculate emory for temporary storage of data Interconnect so processors can talk to each other and the outside world Storage disks and tapes for long term archiving of data These are the basic components but how do we to put them together... odern HC Architectures 3 rocessors ost are RISC architecture Reduced Instruction Set Computer simplify instructions to maximise speed Calculations performed on values in registers separate integer and floating point loading and storing from memory must be done explicitly a = b + c is not an atomic operation involves 2 loads, an addition and a store odern HC Architectures 4 2

Clock Speed Rate at which instructions are issued modern chips are around 2-3 GHz integer and floating point calculations done in parallel can also have multiple issue, e.g. simultaneous add and multiply Whole series of hardware innovations pipelining out-of-order execution, speculative computation... Details become important for top performance most features are fairly generic odern HC Architectures 5 oore s Law CU power doubles every 24 months strictly speaking, applies to transistor density Held true for ~35 years now maybe self-fulfilling? eople have predicted its demise many times but it hasn t happened yet Increases in power are due to increases in parallelism as well as in clock rate fine grain parallelism (pipelining) medium grain parallelism (hardware multithreading) coarse grain parallelism (multiple processors on a chip) First two seem to be (almost) exhausted: main trend is now towards multicore odern HC Architectures 6 3

emory emory speed is often the limiting factor for HC applications keeping the CU fed with data is the key to performance emory is a substantial contributor to the cost of systems typical HC systems have a few Gbytes of memory per processor technically possible to have much more than this, but it is too expensive and power-hungry Basic characteristics latency: how long you have to wait for data to arrive bandwidth: how fast it actually comes in ballpark figures: 100 s of nanoseconds and a few Gbytes/s odern HC Architectures 7 Cache memory emory latencies are very long 100s of processor cycles fetching data from main memory is 2 orders of magnitude slower than doing arithmetic Solution: introduce cache memory much faster than main memory...but much smaller than main memory keeps copies of recently used data odern systems use a hierarchy of two or three levels of cache odern HC Architectures 8 4

emory hierarchy Speed (and cost) 1 cycle CU Registers ~1 Kb Capacity 2-3 cycles L1 Cache ~100 Kb ~20 cycles L2 Cache ~1-10 b ~50 cycles L3 Cache ~10-50 b ~300 cycles ain emory ~1 Gb odern HC Architectures 9 Serial v arallel Computers Serial computers are easier to program than parallel computers but there are limits on single processor performance physical: speed of light, uncertainty principle practical: design, manufacture arallel computers dominate HC because they allow highest performance they are more cost effective Achieving good performance requires high quality algorithms, decomposition and programming odern HC Architectures 10 5

Flynn's Taxonomy Classification of architectures by instruction stream and data stream SISD: Single Instruction Single Data serial machines ISD: ultiple Instructions Single Data (probably) no real examples SID: Single Instruction ultiple Data ID: ultiple Instructions ultiple Data odern HC Architectures 11 SID Architecture Single Instruction ultiple Data Every processor synchronously executes same instructions on different data Instructions issued by front-end Each processor has its own memory where it keeps its data rocessors can communicate with each other Usually thousands of simple processors Examples: DA, asar, C200 odern HC Architectures 12 6

SID Architecture Front-end Network eripherals odern HC Architectures 13 ID Architecture ultiple Instructions ultiple Data Several independent processors capable of executing separate programs Subdivision by relationship between processors and memory odern HC Architectures 14 7

Distributed emory ID-D each processor has its own local memory rocessors connected by some interconnect mechanism rocessors communicate via explicit message passing effectively sending emails to each other Highly scalable architecture allows assively arallel rocessing () Examples Cray XE, IB BlueGene, workstation/c clusters (Beowulf) odern HC Architectures 15 Distributed emory Interconnect odern HC Architectures 16 8

Distributed emory rocessors behave like distinct workstations each runs its own copy of the operating system no interaction except via the interconnect ros adding processors increases memory bandwidth can grow to almost any size Cons scalability relies on good interconnect jobs are placed by user and remain on the same processors potential for high system management overhead odern HC Architectures 17 Shared emory ID-S each processor has access to a global memory store Communications via write/reads to memory caches are automatically kept up-to-date or coherent Simple to program (no explicit communications) Scaling is difficult because of memory access bottleneck Usually modest numbers of processors odern HC Architectures 18 9

Symmetric ultirocessing Each processor in an S has equal access to all parts of memory same latency and bandwidth Bus Examples emory IB servers, Sun HC Servers, multicore Cs odern HC Architectures 19 Shared emory Looks like a single machine to the user a single operating system covers all the processors the OS automatically moves jobs around the CU cores ros simple to use and maintain CC-NUA architectures allow scaling to 100 s of CUs Cons potential problems with simultaneous access to memory sophisticated hardware required to maintain cache coherency scalability ultimately limited by this odern HC Architectures 20 10

Shared emory Cluster Interconnect odern HC Architectures 21 Shared emory Clusters Technology yramid HC S server workstation cluster encouraged clustering of S nodes. i.e.. top-end nodes are the mid-range systems Recent trend towards ulticore processors Low end clusters and Custom HC systems have S nodes. odern HC Architectures 22 11

Shared emory Clusters Combine features of two architectures shared-memory within a node distributed memory between nodes ros constructed as a standard distributed memory machine but with more powerful nodes Cons may be hard to take advantage of mixed architecture more complicated to understand performance combination of interconnect and memory system behaviour Examples clusters of Intel servers, Bull machines, all modern C clusters odern HC Architectures 23 HECToR: Cray XE6 Built from 16-core AD Interlagos CUs each a mini 16-way S with internal bus odern HC Architectures 24 12

A bespoke Cray interconnect essentially a high-end S cluster network is a 3D torus, not a switch 6.4 GB/sec direct connect HyperTransport 2 8 GB main memory Cray SeaStar2+ Interconnect 12.8 GB/sec direct connect memory (DDR 800) odern HC Architectures 25 HECToR System Specifications cont. Cray XE6 parallel processors 2816 compute nodes which contain two AD 2.3 GHz 16-core Opteron processors => 90,112 cores Theoretical peak of 827 Tflops 32 GB main memory per processor, shared between 32 cores => total memory of 90 TB 10 login nodes Gemini interconnect 12 IO nodes odern HC Architectures 26 13

Summary Flynn s taxonomy looks somewhat dated SID likely to remain a niche market Wide variety of memory architectures for ID need to sub-classify by memory any parallel systems based on commodity microprocessors or clusters of Ss providing leverage with commercial products arallel architectures appear to be the present and future of HC odern HC Architectures 27 essage assing odel The message passing model is based on the notion of processes can think of a process as an instance of a running program, together with the program s data In the message passing model, parallelism is achieved by having many processes co-operate on the same task Each process has access only to its own data rocesses communicate with each other by sending and receiving messages odern HC Architectures 28 14

rocess Communication rocess 1 rocess 2 rogram a=23 Recv(1,b) Send(2,a) a=b+1 Data 23 23 24 23 odern HC Architectures 29 Quantifying erformance Serial computing concerned with complexity how execution time varies with problem size N adding two arrays (or vectors) is O(N) matrix times vector is O(N 2 ), matrix-matrix is O(N 3 ) Look for clever algorithms naïve sort is O(N 2 ) divide-and-conquer approaches are O(N log (N)) arallel computing also concerned with scaling how time varies with number of processors different algorithms can have different scaling behaviour but always remember that we are interested in minimum time! odern HC Architectures 30 15

erformance easures T(N,) is time for size N on processors Speedup typically S(N,) < arallel Efficiency typically E(N,) < 1 Serial Efficiency typically E(N) <= 1 odern HC Architectures 31 The Serial Component Amdahl s law the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial Gene Amdahl, 1967 odern HC Architectures 32 16

Amdahl s law Assume a fraction a is completely serial time is sum of serial and potentially parallel arallel time parallel part 100% efficient arallel speedup for a = 0, S = as expected (ie E = 100%) otherwise, speedup limited by 1/ a for any impossible to effectively utilise large parallel machines? odern HC Architectures 33 Gustafson s Law Need larger problems for larger numbers of CUs odern HC Architectures 34 17

Utilising Large arallel achines Assume parallel part is O(N), serial part is O(1) time speedup Scale problem size with CUs, ie set N = speedup efficiency aintain constant efficiency (1-a) for large odern HC Architectures 35 Scaling Real Speed-up graphs Speed-up vs No of Es 300 250 200 Speed-up 150 100 linear actual 50 0 0 50 100 150 200 250 300 No of Es Improving load balance / algorithm increases the turn-over to a higher numbers of processors better scaling = ability to utilise larger computers odern HC Architectures 36 18

Summary Useful definitions Speed-up Efficiency Amdahl s Law the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial Gustafson s Law to maintain constant efficiency we need to scale the problem size with the number of CUs. odern HC Architectures 37 19