Parallel Computing Architectures

Similar documents
Parallel Computing Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS 426 Parallel Computing. Parallel Computing Platforms

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Online Course Evaluation. What we will do in the last week?

WHY PARALLEL PROCESSING? (CE-401)

Parallel Architectures

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Computer Architecture Crash course

Processor Performance and Parallelism Y. K. Malaiya

Handout 2 ILP: Part B

Introducing Multi-core Computing / Hyperthreading

Parallel Computing Ideas

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Introduction to parallel computing

High Performance Computing in C and C++

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Introduction II. Overview

Computer Architecture

Parallel Processing SIMD, Vector and GPU s cont.

Fundamentals of Quantitative Design and Analysis

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

RISC Processors and Parallel Processing. Section and 3.3.6

EC 513 Computer Architecture

BlueGene/L (No. 4 in the Latest Top500 List)

General introduction: GPUs and the realm of parallel architectures

! Readings! ! Room-level, on-chip! vs.!

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Outline Marquette University

EE 4980 Modern Electronic Systems. Processor Advanced

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Approaches to Parallel Computing

Processors. Young W. Lim. May 12, 2016

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

How much energy can you save with a multicore computer for web applications?

3.3 Hardware Parallel processing

THREAD LEVEL PARALLELISM

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 8: RISC & Parallel Computers. Parallel computers

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

" # " $ % & ' ( ) * + $ " % '* + * ' "

Carlo Cavazzoni, HPC department, CINECA

Parallel Computer Architecture - Basics -


ECE 571 Advanced Microprocessor-Based Design Lecture 4

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Hardware-Based Speculation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Lecture 2. Memory locality optimizations Address space organization

COMPUTER ORGANIZATION AND DESI

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Copyright 2010, Elsevier Inc. All rights Reserved

Modern Processor Architectures. L25: Modern Compiler Design

Lecture-13 (ROB and Multi-threading) CS422-Spring

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

HPC VT Machine-dependent Optimization

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Issues in Multiprocessors

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Advanced Parallel Programming I

Computer Science 146. Computer Architecture

Computer Organization + DIGITAL DESIGN

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Summary of Computer Architecture

Lecture 14: Multithreading

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Chapter 2 Lecture 1 Computer Systems Organization

GPU Microarchitecture Note Set 2 Cores

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

High Performance Computing. University questions with solution

COSC 6385 Computer Architecture - Multi Processor Systems

Multithreading: Exploiting Thread-Level Parallelism within a Processor

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Issues in Multiprocessors

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

Module 5 Introduction to Parallel Processing Systems

Exploitation of instruction level parallelism

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Transcription:

Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

2

An Abstract Parallel Architecture Processor Processor Processor Processor Interconnect Network Memory Memory Memory How is parallelism handled? What is the exact physical location of the memories? What is the topology of the interconnect network? 3

Why are parallel architectures important? There is no "typical" parallel computer: different vendors use different architectures There is currently no universal programming paradigm that fits all architectures Parallel programs must be tailored to the underlying parallel architecture The architecture of a parallel computer limits the choice of the programming paradigm that can be used 4

Von Neumann architecture and its extensions 5

Von Neumann architecture Processor (CPU) Memory I/O subsystem System bus 6

Von Neumann architecture Data Address Control Bus R0 R1 ALU Memory Memory Rn-1 PC IR PSW Control 7

The Fetch-Decode-Execute cycle The CPU performs an infinite loop Fetch Decode Fetch the opcode of the next instruction from the memory address stored in the PC register, and put the opcode in the IR register The content of the IR register is analyzed to identify the instruction to execute Execute The control unit activates the appropriate functional units of the CPU to perform the actions required by the instruction (e.g., read values from memory, execute arithmetic computations, and so on) 8

How to limit the bottlenecks of the Von Neumann architecture Reduce the memory access latency Hide the memory access latency Rely on CPU registers whenever possible Use caches Multithreading and context-switch during memory accesses Execute multiple instructions at the same time Pipelining Multiple issue Branch prediction Speculative execution SIMD extensions 9

CPU times compared to the real world 1 CPU cycle Level 1 cache access Main memory access Solid-state disk I/O Rotational disk I/O Internet: SF to NYC Internet: SF to UK Internet: SF to Australia Physical system reboot 0.3 ns 0.9 ns 120 ns 50-150 μs 1-10 ms 40 ms 81 ms 183 ms 1m 1s 3s 6 min 2-6 days 1-12 months 4 years 8 years 19 years 6 millennia Source: https://blog.codinghorror.com/the-infinite-space-between-words/

Caching 11

Cache hierarchy Large memories are slow; fast memories are small CPU L1 Cache L2 Cache L3 Cache (possible) interconnect bus DRAM 12

Cache hierarchy of the AMD Bulldozer architecture Source: http://en.wikipedia.org/wiki/bulldozer_%28microarchitecture%29 13

CUDA memory hierarchy Block Block Shared Memory Registers Thread Local Memory Registers Thread Local Memory Shared Memory Registers Thread Local Memory Registers Thread Local Memory Global Memory Constant Memory Texture Memory 14

How the cache works Cache memory is very fast Often located inside the processor chip Expensive, very small compared to system memory The cache contains a copy of the content of some recently accessed (main) memory locations If the same memory location is accessed again, and its content is in cache, then the cache is accessed instead of the system RAM 15

How the cache works If the CPU accesses a memory location whose content is not already in cache the content of that memory location and "some" adjacent locations are copied in cache in doing so it might be necessary to purge some old data from the cache to make room to the new data The smallest unit of data that can be transferred to/from the cache is the cache line On Intel processors, usually 1 cache line = 64B 16

Example CPU Cache RAM a b c d e f g h i j k l Cache line 17

Example CPU Cache RAM a b c d e f g h i j k l Cache line 18

Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 19

Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 20

Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 21

Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 22

Spatial and temporal locality Cache memory works well when applications exhibit spatial and/or temporal locality Spatial locality Accessing adjacent memory locations is OK Temporal locality Repeatedly accessing the same memory location(s) is OK 23

Example: matrix-matrix product Given two square matrices p, q, compute r = p q j j i i p q r void matmul( double *p, double* q, double *r, int n) { int i, j, k; for (i=0; i<n; i++) { for (j=0; j<n; j++) { r[i*n + j] = 0.0; for (k=0; k<n; k++) { r[i*n + j] += p[i*n + k] * q[k*n + j]; } } } } 24

Matrix representation Matrices in C are stored in row-major order Elements of each row are contiguous in memory Adjacent rows are contiguous in memory p[0][0] p[1][0] p[4][4] p[2][0] p[3][0] p[0][0] p[1][0] p[2][0] p[3][0] 25

Row-wise vs column-wise access Row-wise access is OK: the data is contiguous in memory, so the cache helps (spatial locality) Column-wise access is NOT OK: the accessed elements are not contiguous in memory (strided access) so the cache does NOT help 26

Matrix-matrix product Given two squared matrices p, q, compute r = p q j j i i p OK; rowwise access q r NOT ok; columnwise access 27

Optimizing the memory access pattern p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 p30 p31 p32 p33 p q00 q10 q02 q03 q10 q11 q12 q13 q20 q21 q22 q23 q30 q31 q32 q33 q r00 r01 r02 r03 r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 r 28

Optimizing the memory access pattern p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 q00 q10 q02 q03 p30 p31 p32 p33 q10 q11 q12 q13 q20 q21 q22 q23 r00 r01 r02 r03 q30 q31 q32 q33 p r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 q r Transpose q p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 p30 p31 p32 p33 p q00 q10 q20 q30 q01 q11 q21 q31 q02 q12 q22 q32 q03 q13 q23 q33 qt r00 r01 r02 r03 r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 r 29

But... Transposing the matrix q requires time. Do we gain some advantage in doing so? See cache.c 30

Instruction-Level Parallelism (ILP) 31

Instruction-Level Parallelism Uses multiple functional units to increase the performance of a processor Pipelining: the functional units are organized like an assembly line, and can be used strictly in that order Multiple issue: the functional units can be used whenever required Static multiple issue: the order in which functional units are activated is decided at compile time (example: Intel IA64) Dynamic multiple issue (superscalar): the order in which functional units are activated is decided at run time 32

Instruction-Level Parallelism IF ID IF ID EX MEM WB EX Integer WB Pipelining Instruction Fetch Instruction Decode Execute Memory Access Write Back Instruction Fetch and Decode Unit Integer MEM Floating Point Commit Unit Multiple Issue In-order issue Floating Point In-order commit Load Store Out of order execute 33

Pipelining Instr1 IF ID EX MEM WB Instr2 Instr1 IF ID EX MEM WB Instr3 Instr2 Instr1 IF ID EX MEM WB Instr4 Instr3 Instr2 Instr1 IF ID EX MEM WB Instr5 Instr4 Instr3 Instr2 Instr1 IF ID EX MEM WB 34 Flusso di istruzioni

Control Dependency z = x + y; if ( z w } else w } > 0 ) { = x ; { = y ; The instructions w = x and w = y have a control dependency on z > 0 Control dependencies can limit the performance of pipelined architectures z = x + y; c = z > 0; w = x*c + y*(1-c); 35

In the real world... From GCC documentation http://gcc.gnu.org/onlinedocs/gcc/other-builtins.html Built-in Function: long builtin_expect(long exp, long c) You may use builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect. The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that exp == c. For example: if ( builtin_expect (x, 0)) foo (); 36

Branch Hint: Example #include <stdlib.h> int main( void ) { int A[1000000]; size_t i; const size_t n = sizeof(a) / sizeof(a[0]); for ( i=0; builtin_expect( i<n, 1 ); i++ ) { A[i] = i; } return 0; } Refrain from this kind of microoptimization: ideally, this is stuff for compiler writers 37

Hardware multithreading Allows the CPU to switch to another task when the current task is stalled Fine-grained multithreading A context switch has essentially zero cost The CPU can switch to another task even on stalls of short durations (e.g., waiting for memory operations) Requires CPU with specific support, e.g., Cray XMT Coarse-grained multithreading A context switch has non-negligible cost, and is appropriate only for threads blocked on I/O operations or similar The CPU is less efficient in the presence of stalls of short duration 38

Source: https://www.slideshare.net/jasonriedy/sting-a-framework-for-analyzing-spaciotemporal-interaction-networks-and-graphs

Hardware multithreading Simultaneous multithreading (SMT) is an implementation of fine-grained multithreading where different threads can use multiple functional units at the same time HyperThreading is Intel's implementation of SMT Each physical processor core is seen by the Operating System as two "logical" processors Each logical processor maintains a complete set of the architecture state: general-purpose registers control registers advanced programmable interrupt controller (APIC) registers some machine state registers Intel claims that HT provides a 15 30% speedup with respect to a similar, non-ht processor 40

HyperThreading Queue Queue MEM WB Queue EX Queue ID Queue IF Queue Queue The pipeline stages are separated by two buffers (one for each executing thread) If one thread is stalled, the other one might go ahead and fill the pipeline slot, resulting in higher processor utilization Queue Hyper-Threading Technology Architecture and Microarchitecture http://www.cs.virginia.edu/~mc2zk/cs451/vol6iss1_art01.pdf 41

HyperThreading See the output of lscpu ("Thread(s) per core") or lstopo (hwloc Ubuntu/Debian package) Processor without HT Processor with HT

43

Von Neumann architecture Flynn's Taxonomy Single Multiple Instruction Streams Data Streams Single Multiple SISD SIMD Single Instruction Stream Single Data Stream Single Instruction Stream Multiple Data Streams MISD MIMD Multiple Instruction Streams Single Data Stream Multiple Instruction Streams Multiple Data Streams 44

SIMD SIMD instructions apply the same operation (e.g., sum, product ) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands) Time This means that there must be 4/8/... independent ALUs LOAD A[0] LOAD A[1] LOAD A[2] LOAD A[3] LOAD B[0] LOAD B[1] LOAD B[2] LOAD B[3] C[0] = A[0] + B[0] C[1] = A[1] + B[1] C[2] = A[2] + B[2] C[3] = A[3] + B[3] STORE C[0] STORE C[1] STORE C[2] STORE C[3] 45

SSE Streaming SIMD Extensions Extension to the x86 instruction set Provide new SIMD instructions operating on small arrays of integer or floating-point numbers Introduces 8 new 128-bit registers (XMM0 XMM7) SSE2 instructions can handle 2 64-bit doubles, or 2 64-bit integers, or 4 32-bit integers, or 8 16-bit integers, or 16 8-bit chars 32 128 bit 32 32 32 XMM0 XMM1 XMM7 46

SSE (Streaming SIMD Extensions) 4 lanes 32 bit 32 bit 32 bit 32 bit X3 X2 X1 X0 Y3 Y2 Y1 Y0 Op Op Op Op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 47

Example m128 a = _mm_set_ps( 1.0, 2.0, 3.0, 4.0 ); m128 b = _mm_set_ps( 2.0, 4.0, 6.0, 8.0 ); m128 ab = _mm_mul_ps(a, b); 32 bit 32 bit 32 bit 32 bit a 1.0 2.0 3.0 4.0 b 2.0 4.0 6.0 8.0 ab 2.0 8.0 18.0 32.0 48

GPU Modern GPUs (Graphics Processing Units) have a large number of cores, and can be regarded as a form of SIMD architecture Chip GPU Fermi (fonte: 49 http://www.legitreviews.com/article/1100/1/)

CPU vs GPU The difference between CPUs and GPUs can be appreciated by looking at how the chip surface is used ALU ALU ALU ALU Control Cache DRAM controller DRAM controller CPU GPU 50

GPU core A single CPU core contains a fetch/decode unit shared among multiple ALUs ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx If there are 8 ALU, each instruction can operate on 8 values simultaneously Each GPU core maintains multiple execution contexts, and can switch between them at virtually zero cost Fetch / Decode Fine-grained parallelism 51

GPU Example: 12 instruction streams 8 ALU = 96 operations in parallel 52

MIMD In MIMD systems there are multiple execution units that can execute multiple sequences of instructions Each execution unit generally operates on its own input data Time Multiple Instruction Streams Multiple Data Streams LOAD A[0] CALL F() a = 18 w=7 LOAD B[0] z=8 b=9 t = 13 C[0] = A[0] + B[0] y = 1.7 if ( a>b ) c = 7 k = G(w,t) STORE C[0] z=x+y a=a-1 k=k+1 53

MIMD architectures Shared Memory CPU A set of processors sharing a common memory space Each processor can access any memory location CPU A set of compute nodes connected through an interconnection network The most simple example: cluster of PCs connected via Ethernet Nodes can share data through explicit communications CPU Interconnect Memory Distributed Memory CPU CPU CPU CPU CPU Mem Mem Mem Mem Interconnect 54

Hybrid architectures Many HPC systems are based on hybrid architectures Each compute node is a shared-memory multiprocessor A large number of compute nodes is connected through an interconnect network GPU GPU GPU GPU GPU GPU GPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Mem Mem Mem Mem Interconnect 55

Shared memory example Intel core i7 AMD Istanbul 56

Distributed memory system IBM BlueGene / Q @ CINECA Architecture 10 BGQ Frame Model IBM-BG/Q Processor Type IBM PowerA2, 1.6 GHz Computing Cores 163840 Computing Nodes 10240 RAM 1GByte / core Internal Network 5D Torus Disk Space 2PByte scratch space Peak Performance 2PFlop/s 57

58

59

www.top500.org (June 2017) 61

SANDIA ASCI RED Date: 1996 Peak performance: 1.8Teraflops Floor space: 150m2 Power consumption: 800.000 Watt 62

SANDIA ASCI RED Date: 1996 Peak performance: 1.8Teraflops Floor space: 150m2 Power consumption: 800.000 Watt Sony PLAYSTATION 3 Date: 2006 Peak performance: >1.8Teraflops Floor space: 0.08m2 Power consumption: <200 Watt 63

Inside SONY's PS3 Cell Broadband Engine 64

Empirical rules When writing parallel applications (especially on distributed-memory architectures) keep in mind that: Computation is fast Communication is slow Input/output is incredibly slow 65

Recap Shared memory Advantages: Easier to program Useful for applications with irregular data access patterns (e.g., graph algorithms) Distributed memory Advantages: Disadvantages: The programmer must take care of race conditions Limited memory bandwidth Highly scalable, provide very high computational power by adding more nodes Useful for applications with strong locality of reference, with high computation / communication ratio Disadvantages: Latency of interconnect network Difficult to program 66