CS Parallel Algorithms in Scientific Computing

Similar documents
CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Parallel Architectures

Parallel Architecture. Sathish Vadhiyar

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

CME342 - Parallel Methods in Numerical Analysis

Overview. Processor organizations Types of parallel machines. Real machines

Dr. Joe Zhang PDC-3: Parallel Platforms

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalability and Classifications

Interconnection Network

Advanced Parallel Architecture. Annalisa Massini /2017

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Programming Platforms

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CS575 Parallel Processing

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

COSC 6385 Computer Architecture - Multi Processor Systems

Chapter 9 Multiprocessors

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Physical Organization of Parallel Platforms. Alexandre David

Lecture 2 Parallel Programming Platforms

SHARED MEMORY VS DISTRIBUTED MEMORY

CSC630/CSC730: Parallel Computing

Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Interconnection Network

Computer parallelism Flynn s categories

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Issues in Multiprocessors

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

What is Parallel Computing?

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multi-core Programming - Introduction

Three basic multiprocessing issues

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

4. Networks. in parallel computers. Advances in Computer Architecture

Multiprocessors - Flynn s Taxonomy (1966)

Types of Parallel Computers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Parallel Computing Platforms

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

EE382 Processor Design. Illinois

Interconnection networks

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Issues in Multiprocessors

Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology

EE 4683/5683: COMPUTER ARCHITECTURE

BlueGene/L. Computer Science, University of Warwick. Source: IBM

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

What are Clusters? Why Clusters? - a Short History

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Interconnection Networks. Issues for Networks

Parallel Architectures

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Cray XE6 Performance Workshop

WHY PARALLEL PROCESSING? (CE-401)

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

PARALLEL COMPUTER ARCHITECTURES

Introduction II. Overview

INTERCONNECTION NETWORKS LECTURE 4

Dheeraj Bhardwaj May 12, 2003

Computer Architecture

Parallel Architecture, Software And Performance

Interconnection Networks

ECE 669 Parallel Computer Architecture

Computer Systems Architecture

Lecture notes for CS Chapter 4 11/27/18

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Taxonomy of Parallel Computers, Models for Parallel Computers. Levels of Parallelism

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Computer Architecture and Organization

EE/CSCI 451: Parallel and Distributed Computation

Outline. Parallel Numerical Algorithms. Moore s Law. Limits on Processor Speed. Consequences of Moore s Law. Moore s Law. Consequences of Moore s Law

Parallel Computing Platforms

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Processor Architecture and Interconnect

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Chapter 1: Perspectives

Interconnect Technology and Computational Speed

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 7: Parallel Processing

Chapter 4 Data-Level Parallelism

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

BlueGene/L (No. 4 in the Latest Top500 List)

BİL 542 Parallel Computing

Computer Systems Architecture

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

CS/COE1541: Intro. to Computer Architecture

Model Questions and Answers on

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Transcription:

CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann Introduction to arallel Computing: Design and Analysis of Algorithms Kumar, Grama, Gupta, Karypis, Benjamin Cummings 2

erformance goals 3 Microprocessor performance 4 2

What is a arallel Computer? Almasi-Gotllib 989: A parallel computer is a "collection of processing elements that communicate and cooperate to solve large problems fast". Why parallel architecture? Add new dimension to design space -- number of processors. In principle, achieve higher performance by using more processors. How much additional performance is gained and at what additional cost depends on several factors. 5 Questions How large is the collection? How powerful are the individual processing elements (pe)? Can the number be increased in a straightforward manner? How do they communicate and cooperate? How is data transmitted between pe's? What interconnection topology? 6 3

Taxonomy of arallel Architectures I. By control mechanism - instruction stream and data stream II. III. IV. By process granularity - coarse vs fine grain By address space organization - shared vs distributed memory By interconnection network - dynamic vs static 7 (I) Control Mechanism (Flynn s taxonomy) SISD: Single Instruction stream Single Data stream, e.g. conventional sequential computers. SIMD: Single Instruction stream Multiple Data stream MIMD: Multiple Instruction stream Multiple Data stream MISD: Multiple Instruction stream Single Data stream 8 4

SIMD Multiple processing elements are under the supervision of a control unit Thinking Machine CM-2, Masar M-2, Quadrics SIMD extensions are also present in commercial microprocessors (MMX or Katmai in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4) 9 MIMD Each processing elements is capable of executing a different program independent of the other processors Most multiprocessor can be classified in this category) 0 5

(II) rocess Granularity Coarse grain: Cray C90, Fujitsu small number of very powerful processors Fine grain: CM-2, Quadrics large number of relatively less powerful processors Medium grain: IBM S2, CM-5 between the two extremes. Commuication cost >> computational cost coarse grain Commuication cost << computational cost fine grain (III) Address Space Organization Single/shared address space Uniform Memory Address:SM (UMA) Non Uniform Memory Address (NUMA) Message passing Distributed memory 2 6

Shared Memory SIMD Vector processors Some Cray machines 3 SM Architecture Bus or Crossbar Switch Memory I/O SM uses shared system resources (memory, I/O) that can be accessed equally from all the processors coherence is maintained Expensive to build with many procs. Compaq GS AlphaServers. 4 7

NUMA Architecture Memory Memory Memory Memory Bus or Crossbar Switch Shared address space Memory latency varies whether access local or remote memory coherence (ccnuma) is maintained using hardware or software protocol Can afford more procs. than SM. SGI Origin 2000/3000, Sun Ultra HC servers. 5 Message-assing SIMD Cambridge parallel processing Gamma II, Quadrics 6 8

Message-assing MIMD Memory Memory Memory Memory Communication network Local address space No issue of cache coherence IBM S 7 Dynamic (IV) Interconnection Networks Switches and communication links. Communication links are connected to one another dynamically by switches. Static oint-to-point communication links. Message-passing computers. 8 9

Dynamic Interconnections Crossbar switching : Most expensive and extensive interconnection. 2 M M2 Bus connected : rocessors are connected to memory through a common datapath Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc Butterfly 9 Static Interconnection Completely-connected Star-connected Linear array Mesh: 2D/3D mesh, 2D/3D torus Tree and fat tree network Hypercube network 20 0

Characteristics of Static Networks Diameter: maximum distance between any two processors in the network D= complete connection D=N- linear array D=N/2 ring D=2( N -) 2D mesh D=2 ( (N/2)) 2D torus D=log N hypercube 2 Characteristics of Static Networks (cont.) Bisection width: the minimum number of communications links that have to be removed to partition the network in half. Channel rate: peak rate at which a single wire can deliver bits. Channel bandwidth: it is the product of channel rate and channel width. Bisection bandwidth B: it is the product of bisection width and channel bandwidth. 22

Linear Array, Ring, Mesh, Torus rocessors are arranged as a d-dimensional grid or torus 23 Tree, Fat-tree Tree network: there is only one path between any pair of processors. Fat tree network: increase the number of communication links close to the root. 24 2

Hypercube -D 2-D 3-D 25 Binary Reflected GRAY Code G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with and the original entry with 0. 26 3

Example of BRG Code -bit 2-bit 3-bit 8p ring 8p hyper 0 0 0 0 0 0 0 0 0 0 0 0 2 3 0 0 3 2 0 0 0 0 0 4 5 6 7 6 7 5 4 27 Topology Embedding Mapping a linear array into an hypercube: A linear array (or ring) of 2 d processors can be embedded into a d-dimensional hypercube by mapping processor i onto processor G(i,d) of the hypercube. Mapping a 2 r 2 s mesh on an hypercube: processor(i,j)---> G(i,r) G(j,s) ( denote concatenation). 28 4

Trade-off Among Different Networks Network Minimum latency Maximum Bw per roc Wires Switches Example Completely connected Constant Constant O(p*p) - - Crossbar Constant Constant O(p) O(p*p) Cray Bus Constant O(/p) O(p) O(p) SGI Challenge Mesh O(sqrt p) Constant O(p) - Intel ASCI Red Hypercube O(log p) Constant O(p log p) - Sgi Origin Switched O(log p) Constant O(p log p) O(p log p) IBM S-2 29 Beowulf Cluster built with commodity hardware components C hardware (x86,alpha,owerc) Commercial high-speed interconnection (00Base-T, Gigabit Ethernet, Myrinet,SCI) Linux, Free-BSD operating system http://www.beowulf.org 30 5

Clusters of SM The next generation of supercomputers will have thousand of SM nodes connected. Increase the computational power of the single node Keep the number of nodes low New programming approach needed, MI+Threads (OpenM,threads,.) ASCI White, CompaqSC, IBM S3. http://www.llnl.gov/asci 3 6