Learning Curve for Parallel Applications. 500 Fastest Computers

Similar documents
Evolution and Convergence of Parallel Architectures

ECE 669 Parallel Computer Architecture

Three basic multiprocessing issues

Convergence of Parallel Architecture

Parallel Programming Models and Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Three parallel-programming models

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Parallel Architecture Fundamentals

Limitations of Memory System Performance

Convergence of Parallel Architectures

Uniprocessor Computer Architecture Example: Cray T3E

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Scalable Multiprocessors

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Parallel Computer Architecture

Conventional Computer Architecture. Abstraction

Cray XE6 Performance Workshop

Performance study example ( 5.3) Performance study example

ECE5610/CSC6220 Models of Parallel Computers. Recap: What is Parallel Computer?

Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Programming Platforms

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Outline. Limited Scaling of a Bus

Scalable Distributed Memory Machines

Dr. Joe Zhang PDC-3: Parallel Platforms

Processor Architecture and Interconnect

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Multiprocessor Interconnection Networks

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Architectures

Aleksandar Milenkovich 1

Issues in Multiprocessors

PARALLEL COMPUTER ARCHITECTURES

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Issues in Multiprocessors

History of Distributed Systems. Joseph Cordina

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction

Interconnection Network

Parallel Arch. Review

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology

Message Passing Models and Multicomputer distributed system LECTURE 7

Computer parallelism Flynn s categories

Lecture 9: MIMD Architectures

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Chapter 9 Multiprocessors

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CS Parallel Algorithms in Scientific Computing

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

EE382 Processor Design. Illinois

Handout 3 Multiprocessor and thread level parallelism

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Snoop-Based Multiprocessor Design III: Case Studies

Multi-Processor / Parallel Processing

Lecture 9: MIMD Architectures

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture notes for CS Chapter 4 11/27/18

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing

Lecture 9: MIMD Architecture

What is a parallel computer?

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Parallel Programming. Motivating Problems (application case studies) Process of creating a parallel program

ECE 669 Parallel Computer Architecture

Parallel Architecture. Hwansoo Han

Module 5 Introduction to Parallel Processing Systems

Lecture 1: Parallel Architecture Intro

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Memory Systems in Pipelined Processors

CMSC 611: Advanced. Parallel Systems

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Dheeraj Bhardwaj May 12, 2003

Multiprocessor Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

COSC 6374 Parallel Computation. Parallel Computer Architectures

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EE382 Processor Design. Processor Issues for MP

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

NOW and the Killer Network David E. Culler

Shared Memory Architectures. Approaches to Building Parallel Machines

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Transcription:

Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor Cray T3D 11 500 Fastest Computers Number of systems 350 313 300 250 200 187 150 100 50 239 198 63 284 V S 319 110 106 106 73 0 11/93 11/94 11/95 11/96 33 1

Shared Address Space odel rocess: virtual address space plus one or more threads of control ortions of address spaces of processes are shared Virtual address spaces for a collection of processes communicating via shared addresses achine physical address space n pr i vat e Load n 1 2 Common physical addresses 0 St or e Shared portion of address space 2 pr i vat e rivate portion of address space 1 pr i vat e 0 pr i vat e Writes to shared address visible to other threads (in other processes too) Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization OS uses shared memory to coordinate processes 44 Communication Hardware Also natural extension of uniprocessor Already have processor, one or more memory modules and controllers connected by hardware interconnect of some sort devices em em em em ctrl ctrl Interconnect Interconnect rocessor rocessor emory capacity increased by adding modules, by controllers Add processors for processing! For higher-throughput multiprogramming, or parallel programs 45 2

History ainframe approach otivated by multiprogramming Extends crossbar used for mem bw and Originally processor cost limited to small later, cost of crossbar Bandwidth scales with p High incremental cost; use multistage instead C C inicomputer approach Almost all microprocessor systems have bus otivated by multiprogramming, T Used heavily for parallel computing Called symmetric multiprocessor (S) Latency larger than for uniprocessor Bus is bandwidth bottleneck caching is key: coherence problem Low incremental cost C C 46 Example: Intel entium ro Quad CU Interrupt 256-KB controller L 2 -ro module -ro module -ro module Bus interface -ro bus (64-bit data, 36-bit addr ess, 66 Hz) CI bridge CI bridge emory controller CI cards CI bus CI bus IU 1-, 2-, or 4-way interleaved DRA All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth 47 3

Example: SUN Enterprise CU/mem cards 2 2 em ctrl Bus interface/switch Gigaplane bus (256 data, 41 address, 83 Hz) Bus interface cards 100bT, SCSI SBUS SBUS SBUS 2 FiberChannel 16 cards of either type: processors + memory, or All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 48 Scaling Up Network Network Dance hall Distributed memory roblem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data? 49 4

Example: Cray T3E External em em ctrl and NI XY Switch Z Scale up to 1024 processors, 480B/s links emory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) 50 essage assing Architectures Complete computer as building block, including Communication via explicit operations rogramming model: directly access only private address space (local memory), comm. via explicit messages (send/receive) High-level block diagram similar to distributed-memory SAS But comm. integrated at IO level, needn t be into memory system Like networks of workstations (clusters), but tighter integration Easier to build than scalable SAS rogramming model more removed from basic hardware operations Library or OS intervention 51 5

essage-assing Abstraction atch Receive Y,, t Send X, Q, t AddressY Address X Local process address space Local process address space rocess rocess Q Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into emory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too any overheads: copying, buffer management, protection 52 Evolution of essage-assing achines 101 100 001 000 111 110 Early machines: FIFO on each link Hw close to prog. odel; synchronous ops Replaced by DA, enabling non-blocking ops Buffered by system at destination until recv Diminishing role of topology Store&forward routing: topology important Introduction of pipelined routing made it less so Cost is in node-network interface Simplifies programming 011 010 53 6

Example: IB S-2 ower 2 CU IB S-2 node L 2 emory bus General inter connection network formed fr om 8-port switches emory controller 4-way interleaved DRA icrochannel bus NIC i860 DA NI DRA ade out of essentially complete RS6000 workstations Network interface integrated in bus (bw limited by bus) 54 Example Intel aragon i860 L 1 i860 L 1 Intel aragon node emory bus (64-bit, 50 Hz) em ctrl DA Sandia s Intel aragon X/S-based Supercomputer 4-way interleaved DRA Driver NI 2D grid network with processing node attached to every switch 8 bits, 175 Hz, bidirectional 55 7

Toward Architectural Convergence Evolution and role of software have blurred boundary Send/recv supported on SAS machines via buffers Can construct global address space on using hashing age-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for (low-latency, high-bandwidth) At lower level, even hardware SAS passes hardware messages Even clusters of workstations/ss are parallel systems Emergence of fast system area networks (SAN) rogramming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines 56 Convergence: Generic arallel Architecture A generic modern multiprocessor Network em Communication assist (CA) Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network Convergence allows lots of innovation, now within framework Integration of assist with node, what operations, how efficiently... 64 8