POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

Similar documents
POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

Chapter 4 Data-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Parallel Computing: Parallel Architectures Jin, Hai

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

WHY PARALLEL PROCESSING? (CE-401)

Parallel Architecture. Hwansoo Han

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Online Course Evaluation. What we will do in the last week?

Static Compiler Optimization Techniques

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CMSC 611: Advanced. Parallel Systems

Computer Architecture

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Parallel Systems I The GPU architecture. Jan Lemeire

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer parallelism Flynn s categories

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computer Systems Architecture

Introduction to Computing and Systems Architecture

Hakam Zaidan Stephen Moore

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Computer Systems Architecture

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Lect. 2: Types of Parallelism

Processor Performance and Parallelism Y. K. Malaiya

ESE 545 Computer Architecture. Data Level Parallelism (DLP), Vector Processing and Single-Instruction Multiple Data (SIMD) Computing

EE 4683/5683: COMPUTER ARCHITECTURE

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Processor Architecture and Interconnect

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Module 5 Introduction to Parallel Processing Systems

Chapter 1: Perspectives

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Architecture of parallel processing in computer organization

Parallel Architectures

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

COSC4201 Multiprocessors

Multiprocessors & Thread Level Parallelism

Advanced Topics in Computer Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Lecture 9: MIMD Architecture

Lecture 1: Introduction

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Lecture 7: Parallel Processing

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Lecture 7: Parallel Processing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

THREAD LEVEL PARALLELISM

Chapter-4 Multiprocessors and Thread-Level Parallelism

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Handout 3 Multiprocessor and thread level parallelism

Parallel Processing SIMD, Vector and GPU s cont.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Advanced Computer Architecture

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

COMPUTER ORGANIZATION AND DESIGN

Issues in Multiprocessors

Parallel Computing Platforms

The University of Texas at Austin

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multi-core Programming - Introduction

CS654 Advanced Computer Architecture. Lec 12 Vector Wrap-up and Multiprocessor Introduction

Issues in Multiprocessors

Lecture 9: MIMD Architectures

Chapter 7. Multicores, Multiprocessors, and Clusters

Why Multiprocessors?

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Science 146. Computer Architecture

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Lecture 24: Virtual Memory, Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

Exploitation of instruction level parallelism

Computer Architecture

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Lecture 24: Memory, VM, Multiproc

Transcription:

POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures! Multiprocessors Politecnico di Milano! SeminarRoom, Bld 20! 30 November, 2017! Antonio Miele! Marco Santambrogio! Politecnico di Milano!

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector architectures MIMD architectures A real life example What s next 2

Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 3

The Cray XD1 example The XD1 uses AMD Opteron 64-bit CPUs and it incorporates Xilinx Virtex-II FPGAs Performance gains from FPGA RC5 Cipher Breaking 1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography 895-1300x faster than 1 GHz P3 Vehicular Traffic Simulation 300x faster on XC2V6000 than 1.7 GHz Xeon 650xfaster on XC2VP100 than 1.7 GHz Xeon Smith Waterman DNA matching 28x faster than 2.4 GHz Opteron 4

Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 5

Parallel Architectures Definition: A parallel computer is a collection of processing elements that cooperates and communicate to solve large problems fast Almasi and Gottlieb, Highly Parallel Computing, 1989 The aim is to replicate processors to add performance vs design a faster processor. Parallel architecture extends traditional computer architecture with a communication architecture abstractions (HW/SW interface) different structures to realize abstraction efficiently 6

Beyond ILP ILP architectures (superscalar, VLIW...): Ø Support fine-grained, instruction-level parallelism; Ø Fail to support large-scale parallel systems; Multiple-issue CPUs are very complex, and returns (as far as extracting greater parallelism) are diminishing ð extracting parallelism at higher levels becomes more and more attractive. A further step: process- and thread-level parallel architectures. To achieve ever greater performance: connect multiple microprocessors in a complex system. 7

Beyond ILP Most recent microprocessor chips are multiprocessor on-chip: Intel i5, i7, IBM Power 8, Sun Niagara Major difficulty in exploiting parallelism in multiprocessors: suitable software ð being (at least partially) overcome, in particular, for servers and for embedded applications which exhibit natural parallelism without the need of rewriting large software chunks 8

Flynn Taxonomy (1966) SISD - Single Instruction Single Data Uniprocessor systems MISD - Multiple Instruction Single Data No practical configuration and no commercial systems SIMD - Single Instruction Multiple Data Simple programming model, low overhead, flexibility, custom integrated circuits MIMD - Multiple Instruction Multiple Data Scalable, fault tolerant, off-the-shelf micros 9

10 Flynn

SISD A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer 11

SIMD A type of parallel computer Single instruction: all processing units execute the same instruction at any given clock cycle Multiple data: each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing 12

MISD A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. 13

MIMD Nowadays, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic 14

Which kind of multiprocessors? Many of the early multiprocessors were SIMD SIMD model received great attention in the 80 s, today is applied only in very specific instances (vector processors, multimedia instructions); MIMD has emerged as architecture of choice for general-purpose multiprocessors Lets see these architectures more in details.. 15

SIMD - Single Instruction Multiple Data Same instruction executed by multiple processors using different data streams. Each processor has its own data memory. Single instruction memory and control processor to fetch and dispatch instructions Processors are typically special-purpose. Simple programming model. 16

SIMD Architecture Central controller broadcasts instructions to multiple processing elements (PEs) Array Controller Inter-PE Connection Network PE PE PE PE PE PE PE PE Control Data M e m M e m M e m M e m M e m M e m M e m M e m 17 ü Only requires one controller for whole array ü Only requires storage for one copy of program ü All computations fully synchronized

18 Reality: Sony Playstation 2000

Alternative Model: Vector Processing Vector processors have high-level operations that work on linear arrays of numbers: "vectors" SCALAR (1 operation) VECTOR (N operations) r1 r2 + r3 add r3, r1, r2 v1 v2 + v3 vector length add.vv v3, v1, v2 19 25

Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 20

Vector programming model Scalar Registers r15 v15 Vector Registers r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR Vector Arithmetic Instructions ADDV v3, v1, v2 v1 v2 v3 + + + + + + [0] [1] [VLR-1] Vector Load and Store Instructions LV v1, r1, r2 v1 Vector Register 21 Base, r1 Stride, r2 Memory

Vector Code Example # C code for (i=0;i<64; i++) C[i] = A[i]+B[i]; # Scalar Code LI R4, #64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, #64 LV V1, R1 LV V2, R2 ADDV.D V3,V1,V2 SV V3, R3 22

Vector Arithmetic Execution Use deep pipeline (=> fast clock) to execute element operations V 1 V 2 V 3 Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline V3 <- v1 * v2 23

Vector Instruction Execution ADDV C,A,B Execution using one pipelined functional unit Execution using four pipelined functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] 24

Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, Elements 1, 5, 9, Elements 2, 6, 10, Elements 3, 7, 11, Lane Memory Subsystem 25

T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register elements striped over lanes [24][25] [26] [27][28] [16][17] [18] [19][20] [8] [9] [10] [11][12] [0] [1] [2] [3] [4] [29] [21] [13] [5] [30] [22] [14] [6] [31] [23] [15] [7] Lane 26

Vector Applications Limited to scientific computing? Multimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95 27

MIMD - Multiple Instruction Multiple Data Each processor fetches its own instructions and operates on its own data. Processors are often off-the-shelf microprocessors. Scalable to a variable number of processor nodes. Flexible: single-user machines focusing on high-performance for one specific application, multi-programmed machines running many tasks simultaneously, some combination of these functions. Cost/performance advantages due to the use of offthe-shelf microprocessors. Fault tolerance issues. 28

Why MIMD? MIMDs are flexible they can function as single-user machines for high performances on one application, as multiprogrammed multiprocessors running many tasks simultaneously, or as some combination of such functions; Can be built starting from standard CPUs (such is the present case nearly for all multiprocessors!). 29

MIMD To exploit a MIMD with n processors at least n threads or processes to execute independent threads typically identified by the programmer or created by the compiler. Parallelism is contained in the threads thread-level parallelism. Thread: from a large, independent process to parallel iterations of a loop. Important: parallelism is identified by the software (not by hardware as in superscalar CPUs!)... keep this in mind, we'll use it! 30

MIMD Machines Existing MIMD machines fall into 2 classes, depending on the number of processors involved, which in turn dictates a memory organization and interconnection strategy. Centralized shared-memory architectures at most few dozen processor chips (< 100 cores) Large caches, single memory multiple banks Often called symmetric multiprocessors (SMP) and the style of architecture called Uniform Memory Access (UMA) Distributed memory architectures To support large processor counts Requires high-bandwidth interconnect Disadvantage: data communication among processors Node Node Node N0 P0 CO N3 P3 C3 Interconnection Network MM0 MM1 MM2 MM2 P0, P1, P2, P3 Processor Processor Processor Cache Cache Cache Main Memory Main Memory Main Memory 31 Interconnection Network

Key issues to design multiprocessors How many processors? How powerful are processors? How do parallel processors share data? Where to place the physical memory? How do parallel processors cooperate and coordinate? What type of interconnection topology? How to program processors? How to maintain cache coherency? How to maintain memory consistency? How to evaluate system performance? 32

The Tilera example 33

34 Create the most amazing game console

One core A 64-bit Power Architecture core Two-issue superscalar execution Two-way multithreaded core In-order execution Cache 32 KB instruction and a 32 KB data Level 1 cache 512 KB Level 2 cache The size of a cache line is 128 bytes One core to rule them all 35

Cell: PS3 Cell is a heterogeneous chip multiprocessor One 64-bit Power core 8 specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit) 36

37 Duck Demo SPE Usage

Xenon: XBOX360 Three symmetrical cores each two way SMT-capable and clocked at 3.2 GHz SIMD: VMX128 extension for each core 1 MB L2 cache (lockable by the GPU) running at halfspeed (1.6 GHz) with a 256-bit bus 38

Microsoft vision Microsoft envisions a procedurally rendered game as having at least two primary components: Host thread: a game's host thread will contain the main thread of execution for the game Data generation thread: where the actual procedural synthesis of object geometry takes place These two threads could run on the same PPE, or they could run on two separate PPEs. In addition to the these two threads, the game could make use of separate threads for handling physics, artificial intelligence, player input, etc. 39

40 The Xenon architecture

Questions RISK more than others think is safe, CARE more than others think is wise, DREAM more than other think is practical, EXPECT more than others think is possible cadel maxim 41