Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Similar documents
Parallel Computing Platforms

Convergence of Parallel Architecture

Parallel Programming Models and Architecture

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Learning Curve for Parallel Applications. 500 Fastest Computers

Multiprocessors - Flynn s Taxonomy (1966)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Computer parallelism Flynn s categories

Processor Architecture and Interconnect

Evolution and Convergence of Parallel Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Issues in Multiprocessors

ECE 669 Parallel Computer Architecture

Issues in Multiprocessors

Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Uniprocessor Computer Architecture Example: Cray T3E

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Architecture. Hwansoo Han

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Three basic multiprocessing issues

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 9: MIMD Architecture

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 9: MIMD Architectures

Multiprocessors & Thread Level Parallelism

Convergence of Parallel Architectures

Computer Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lect. 2: Types of Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

Chapter 1: Perspectives

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Interconnection Network

Cache Coherence in Bus-Based Shared Memory Multiprocessors

BlueGene/L (No. 4 in the Latest Top500 List)

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Online Course Evaluation. What we will do in the last week?

Module 5 Introduction to Parallel Processing Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computing. Hwansoo Han (SKKU)

ECE5610/CSC6220 Models of Parallel Computers. Recap: What is Parallel Computer?

CMSC 611: Advanced. Parallel Systems

WHY PARALLEL PROCESSING? (CE-401)

Introduction to Parallel Computing

The Cache-Coherence Problem

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

COSC4201 Multiprocessors

What is a parallel computer?

Multi-core Programming - Introduction

Lecture 9: MIMD Architectures

Lecture 2 Parallel Programming Platforms

Organisasi Sistem Komputer

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

COSC 6374 Parallel Computation. Parallel Computer Architectures

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Computer Organization. Chapter 16

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Aleksandar Milenkovich 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

COSC 6374 Parallel Computation. Parallel Computer Architectures

CS Parallel Algorithms in Scientific Computing

Lecture 24: Memory, VM, Multiproc

Parallel Computers. c R. Leduc

CDA3101 Recitation Section 13

Overview. Processor organizations Types of parallel machines. Real machines

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Introduction II. Overview

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Lecture 24: Virtual Memory, Multiprocessors

Introduction to parallel computing

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Comp. Org II, Spring

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Systems Architecture

Parallel Processing & Multicore computers

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Comp. Org II, Spring

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Dheeraj Bhardwaj May 12, 2003

Chapter 11. Introduction to Multiprocessors

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

What are Clusters? Why Clusters? - a Short History

CSE 392/CS 378: High-performance Computing - Principles and Practice

Parallel Architecture. Sathish Vadhiyar

Transcription:

Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Elements of a Parallel Computer Hardware Multiple processors Multiple memories Interconnection network System software Parallel operating system Programming constructs to express/orchestrate concurrency Application software Parallel algorithms Goal: utilize the hardware, system and application software to Achieve speedup: T p = T s /p Solve problems requiring a large amount of memory 2

Parallel Computing Platform Logical organization The user s view of the machine as it is being presented via its system software Physical organization The actual hardware architecture Physical architecture is to a large extent independent of the logical architecture Ex) message passing on shared memory architecture, distributed shared memory system 3

Logical Organization Elements Control mechanism Flynn s taxonomy Single-core processor not covered SISD Single Instruction stream Single Data stream MISD Multiple Instruction stream Single Data stream SIMD Single Instruction stream Multiple Data stream MIMD Multiple Instruction stream Multiple Data stream Multi-core processor 4

SIMD vs. MIMD SIMD architecture MIMD architecture 5

SIMD Exploit data parallelism The same instruction on multiple data items 16-byte boundaries for (i=0; i<n; i++) a[i] = b[i] + c[i]; SIMD unit b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11...... vr1 vr2 b0 b1 b2 b3 c0 c1 c2 c3 b0+ b1+ b2+ b3+ c0 c1 c2 c3 vr3 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11... 6

SIMD Exploit data parallelism The same instruction on multiple data items SIMD units in processors Supercomputers: BlueGene/Q PC: MMX/SSE/AVX (x86), AltiVec/VMX (PowerPC), Embedded systems: Neon (ARM), VLIW+SIMD DSPs Co-processors: nvidia GPGPU 7

MIMD Multiple instructions on multiple data items A collection of independent processing elements (or cores) Usually exploits thread-level parallelism Modern parallel computing platforms SIMD can also work on this system 8

Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing 9

Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multi-programmed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors 10

Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space 11

x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) 12

Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module High latency and low bandwidth CPU Interrupt controller 256-KB L 2 $ P-Pro module P-Pro module P-Pro module Bus interface P-Pro bus (64-bit data, 36-bit address, 66 MHz) PCI bridge PCI bridge Memory controller PCI I/O cards PCI bus PCI bus MIU 1-, 2-, or 4-way interleaved DRAM 13

Shared Address Space Architecture Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems Hardware- or software-based distributed shared memory (DSM) UMA system NUMA system Distributed memory system 14

Scaling Up M M M Network Network $ $ $ M $ M $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) 15

Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma 16

Parallel Programming Models Shared Address Space Message Passing Data Parallel 17

Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributedmemory shared address space system But communication integrated to I/O level, not memory-level Easier to build than scalable SAS 18

Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Addr ess Y Addr ess X Local pr ocess addr ess space Local pr ocess addr ess space Pr ocess P Pr ocess Q Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection 19

Message Passing Architectures Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems UMA system NUMA system Distributed memory system 20

Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip 21

Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus Bandwidth limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus General interconnection network formed from 8-port switches Memory controller 4-way interleaved DRAM MicroChannel bus NIC I/O i860 DMA NI DRAM 22

Taxonomy of Common Systems Large-scale shared address space and message passing systems Large multiprocessors Shared address space Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Ex) Cray T3E, X1 Constellation cluster of DSMs or SMPs Ex) SGI Altix, ASC Purple 23

Parallel Programming Models Shared Address Space Message Passing Data Parallel 24

Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver 25

Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single Program Multiple Data (SPMD) Contributes need for fast global synchronization Can be implemented on either shared address space or message passing systems Same program on different PEs, behavior conditional on thread ID 26