Convergence of Parallel Architecture

Similar documents
Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

Parallel Programming Models and Architecture

Evolution and Convergence of Parallel Architectures

Learning Curve for Parallel Applications. 500 Fastest Computers

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

ECE 669 Parallel Computer Architecture

Three basic multiprocessing issues

Uniprocessor Computer Architecture Example: Cray T3E

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Convergence of Parallel Architectures

Conventional Computer Architecture. Abstraction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 9: MIMD Architectures

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Processor Architecture and Interconnect

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Parallel Architecture. Hwansoo Han

Lecture 9: MIMD Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

What is a parallel computer?

Lecture 1: Parallel Architecture Intro

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Issues in Multiprocessors

Scalable Distributed Memory Machines

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Issues in Multiprocessors

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Multiprocessors - Flynn s Taxonomy (1966)

The Cache-Coherence Problem

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Arch. Review

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 9: MIMD Architectures

CMSC 611: Advanced. Parallel Systems

COSC4201 Multiprocessors

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Aleksandar Milenkovich 1

Computer Systems Architecture

Computer Systems Architecture

Computer parallelism Flynn s categories

Parallel Computing. Hwansoo Han (SKKU)

COSC 6385 Computer Architecture - Multi Processor Systems

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

ECE5610/CSC6220 Models of Parallel Computers. Recap: What is Parallel Computer?

Parallel Architectures

Chapter 9 Multiprocessors

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Parallel Architecture Fundamentals

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

NOW Handout Page 1 NO! Today s Goal: CS 258 Parallel Computer Architecture. What will you get out of CS258? Will it be worthwhile?

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 2 Parallel Programming Platforms

Handout 3 Multiprocessor and thread level parallelism

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Multiprocessors & Thread Level Parallelism

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computer Architecture

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

COSC 6374 Parallel Computation. Parallel Computer Architectures

Chapter 5. Multiprocessors and Thread-Level Parallelism

Memory Systems in Pipelined Processors

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Module 5 Introduction to Parallel Processing Systems

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Parallel Computers. c R. Leduc

Dheeraj Bhardwaj May 12, 2003

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

Shared Symmetric Memory Systems

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 1: Perspectives

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture

WHY PARALLEL PROCESSING? (CE-401)

Three parallel-programming models

Introduction to Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 24: Virtual Memory, Multiprocessors

Lect. 2: Types of Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Multiprocessor Synchronization

Multiprocessors 1. Outline

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Transcription:

Parallel Computing Convergence of Parallel Architecture Hwansoo Han

History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty of direction paralyzed parallel software development Application Software Systolic Arrays Dataflow System Software Architecture SIMD Shared Memory Message Passing 2

Today Extension of computer architecture OLD: Instruction Set Architecture NEW: Communication Architecture Communication architecture Organizational structures which implement interfaces Can be implemented with HW or SW Compilers, libraries and OS are important bridges to communication architecture 3

Modern Layered Framework CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel Pr ogramming models Compilation or library Communication abstraction User/system boundary Operating systems support Communication har dwar e Har dwar e/softwar e boundary Physical communication medium 4

Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing 5

Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multiprogrammed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors 6

Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space 7

x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) 8

P C I b u s P C I b u s Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module high latency and low bandwidth CPU Interrupt contr oller 256-KB L 2 $ P-Pr o module P-Pr o module P-Pr o module Bus interface P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz) PCI bridge PCI bridge Memory contr oller PCI I/O car ds MIU 1-, 2-, or 4-way interleaved DRAM 9

Example: Sun SPARC Enterprise M9000 64 SPARC64 VII+ quad-core processors (256 cores) Crossbar bandwidth: 245GB/sec (snoop bandwidth) Memory latency: 437~532 ns (1050~1277 cycles @ 2.4GHz) Higher bandwidth but higher latency 10

Scaling Up M M M Network Network $ $ $ M M $ $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) 11

Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma 12

Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 13

Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributed-memory SAS But communication integrated to I/O level, not memory-level Easier to build than scalable SAS 14

Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Addr ess Y Addr ess X Local pr ocess addr ess space Local pr ocess addr ess space Pr ocess P Pr ocess Q Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection 15

Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip 16

D R A M Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus BW limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus General inter connection network formed fr om 8-port switches Memory contr oller 4-way interleaved DRAM Micr ochannel bus NIC I/O DMA i860 NI 17

Taxonomy of Common Systems Large-Scale SAS and MP systems Shared address space Large multiprocessors Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Constellation cluster of DSMs or SMPs Ex) Cray T3E, X1 Ex) SGI Altix, ASC Purple 18

Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 19

Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver 20

Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single program multiple data (SPMD) Contributes need for fast global synchronization Can be implemented on either SAS or MP 21

Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 22

Dataflow Architectures Dataflow architecture Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands Message (tokens) carrying tag of next instruction sent to next processor Network 1 b + - a c d e Dataflow graph a = (b+1) (b-c) d = c e f = a d T oken stor e W aiting Matching Pr ogram stor e Instruction fetch Execute Form token Network f T oken queue Network 23

Systolic Architecture VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern Replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access Systolic array for 1D convolution y ( i ) = w 1 x ( i ) + w 2 x ( i + 1) + w 3 x ( i + 2) + w 4 x ( i + 3) x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 y 3 y 2 y 1 w 4 w 3 w 2 w 1 x in y in x w x out y out x out = x x = x in y out = y in + w x in 24

Generic Parallel Architecture Convergence to a generic parallel multiprocessor Node: processor(s), memory system, communication assist Scalable network Convergence allows lots of innovation Integration of communication Efficient operation across nodes Network Mem Communication assist (CA) $ P 25