Parallel Computing Platforms

Similar documents
Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Convergence of Parallel Architecture

Parallel Programming Models and Architecture

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Learning Curve for Parallel Applications. 500 Fastest Computers

Multiprocessors - Flynn s Taxonomy (1966)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Computer parallelism Flynn s categories

Processor Architecture and Interconnect

Issues in Multiprocessors

Evolution and Convergence of Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Issues in Multiprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

ECE 669 Parallel Computer Architecture

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Dr. Joe Zhang PDC-3: Parallel Platforms

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Three basic multiprocessing issues

Parallel Architectures

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Interconnection Network

Parallel Architecture. Hwansoo Han

Lecture 9: MIMD Architecture

Online Course Evaluation. What we will do in the last week?

Convergence of Parallel Architectures

Multiprocessors & Thread Level Parallelism

Chapter 1: Perspectives

Lecture 9: MIMD Architectures

Computer Architecture

Lect. 2: Types of Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC 6385 Computer Architecture - Multi Processor Systems

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Module 5 Introduction to Parallel Processing Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Chap. 4 Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

WHY PARALLEL PROCESSING? (CE-401)

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

ECE5610/CSC6220 Models of Parallel Computers. Recap: What is Parallel Computer?

BlueGene/L (No. 4 in the Latest Top500 List)

COSC4201 Multiprocessors

Multi-core Programming - Introduction

Parallel Computing. Hwansoo Han (SKKU)

CMSC 611: Advanced. Parallel Systems

Introduction to Parallel Computing

Introduction II. Overview

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

COSC 6374 Parallel Computation. Parallel Computer Architectures

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Comp. Org II, Spring

What is a parallel computer?

Introduction to parallel computing

Parallel Processing & Multicore computers

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 9: MIMD Architectures

Lecture 2 Parallel Programming Platforms

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Comp. Org II, Spring

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CDA3101 Recitation Section 13

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Organisasi Sistem Komputer

Computer Systems Architecture

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Computer Organization. Chapter 16

COSC 6374 Parallel Computation. Parallel Computer Architectures

CSE 392/CS 378: High-performance Computing - Principles and Practice

CS Parallel Algorithms in Scientific Computing

Lecture 24: Memory, VM, Multiproc

Parallel Computers. c R. Leduc

Computer Systems Architecture

Overview. Processor organizations Types of parallel machines. Real machines

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Lecture 24: Virtual Memory, Multiprocessors

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Architectures

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

The Cache-Coherence Problem

Dheeraj Bhardwaj May 12, 2003

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Transcription:

Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

Elements of a Parallel Computer Hardware Multiple processors Multiple memories Interconnection network System software Parallel operating system Programming constructs to express/orchestrate concurrency Application software Parallel algorithms Goal: utilize the hardware, system and application software to Achieve speedup: Tp = Ts/p Solve problems requiring a large amount of memory SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 2

Parallel Computing Platform Logical organization The user s view of the machine as it is being presented via its system software Physical organization The actual hardware architecture Physical architecture is to a large extent independent of the logical architecture Ex) message passing on shared memory architecture, distributed shared memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 3

Logical Organization Elements Control mechanism Flynn s taxonomy Single-core processor SISD Single Instruction stream Single Data stream not covered MISD Multiple Instruction stream Single Data stream SIMD Single Instruction stream Multiple Data stream MIMD Multi-core processor Multiple Instruction stream Multiple Data stream SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 4

SIMD vs. MIMD SIMD architecture MIMD architecture SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 5

SIMD Exploit data parallelism The same instruction on multiple data items 16-byte boundaries for (i=0; i<n; i++) a[i] = b[i] + c[i]; b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11... c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11... SIMD unit vr1 vr2 b0 b1 b2 b3 c0 c1 c2 c3 b0+ c0 b1+ c1 b2+ c2 b3+ c3 vr3 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11... SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 6

SIMD Exploit data parallelism The same instruction on multiple data items SIMD units in processors Supercomputers: BlueGene/Q PC: MMX/SSE/AVX (x86), AltiVec/VMX (PowerPC), Embedded systems: Neon (ARM), VLIW+SIMD DSPs Co-processors: GPGPUs SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 7

MIMD Multiple instructions on multiple data items A collection of independent processing elements (or cores) Usually exploits thread-level parallelism Modern parallel computing platforms E.g., multicore processors SIMD can also work on this system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 8

Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 9

Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multi-programmed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 10

Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 11

x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 12

Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module High latency and low bandwidth CPU Interrupt controller 256-KB L 2 $ P-Pro module P-Pro module P-Pro module Bus interface P-Pro bus (64-bit data, 36-bit address, 66 MHz) PCI bridge PCI bridge Memory controller PCI I/O cards PCI bus PCI bus MIU 1-, 2-, or 4-way interleaved DRAM SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 13

Shared Address Space Architecture Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems Hardware- or software-based distributed shared memory (DSM) UMA system NUMA system Distributed memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 14

Scaling Up M M M Network Network $ $ $ M $ M $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 15

Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 16

Parallel Programming Models Shared Address Space Message Passing Data Parallel SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 17

Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributedmemory shared address space system But communication integrated to I/O level, not memory-level Easier to build than scalable SAS SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 18

Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Address Y Address X Local pr ocess address space Local pr ocess address space Pr ocess P Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Process Q Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 19

Message Passing Architectures Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems UMA system NUMA system Distributed memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 20

Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 21

Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus Bandwidth limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus Memory controller 4-way interleaved DRAM MicroChannel bus NIC I/O i860 DMA NI DRAM SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 22

Taxonomy of Common Systems Large-scale shared address space and message passing systems Large multiprocessors Shared address space Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Constellation cluster of DSMs or SMPs Ex) Cray T3E, X1 Ex) SGI Altix, ASC Purple SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 23

Parallel Programming Models Shared Address Space Message Passing Data Parallel SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 24

Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 25

Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single Program Multiple Data (SPMD) Contributes need for fast global synchronization Can be implemented on either shared address space or message passing systems Same program on different PEs, behavior conditional on thread ID SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 26