EE/CSCI 451: Parallel and Distributed Computation

Similar documents
EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation

4. Networks. in parallel computers. Advances in Computer Architecture

EE/CSCI 451: Parallel and Distributed Computation

Communication Cost in Parallel Computing

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

CS575 Parallel Processing

Predicting the Running Times of Parallel Programs by Simulation

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

Interconnect Technology and Computational Speed

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

EE/CSCI 451: Parallel and Distributed Computation

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Computer Networks. ENGG st Semester, 2010 Hayden Kwok-Hay So

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks

Interconnection Networks

A Simple and Asymptotically Accurate Model for Parallel Computation

From Routing to Traffic Engineering

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Lecture 7: Flow Control - I

Physical Organization of Parallel Platforms. Alexandre David

EE/CSCI 451 Midterm 1

INTERCONNECTION NETWORKS LECTURE 4

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 22: Router Design

Lecture: Interconnection Networks

Flow Control can be viewed as a problem of

Network on Chip Architecture: An Overview

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Multicomputer distributed system LECTURE 8

CMSC 611: Advanced. Interconnection Networks

TDT Appendix E Interconnection Networks

EE/CSCI 451: Parallel and Distributed Computation

Outline. Computer Communication and Networks. The Network Core. Components of the Internet. The Network Core Packet Switching Circuit Switching

Network-on-chip (NOC) Topologies

On Packet Switched Networks for On-Chip Communication

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

A Hybrid Interconnection Network for Integrated Communication Services

Lecture 2 Parallel Programming Platforms

Communication Performance in Network-on-Chips

Lecture 3: Flow-Control

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

A closer look at network structure:

Concurrent/Parallel Processing

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

in Oblivious Routing

Interconnection Network

Interconnection Networks

Tutorial 2 : Networking

EECS 570 Final Exam - SOLUTIONS Winter 2015

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Message Passing Models and Multicomputer distributed system LECTURE 7

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Midterm #2 Exam April 26, 2006 CS162 Operating Systems

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Network Control and Signalling

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

EE/CSCI 451: Parallel and Distributed Computation

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Lecture 6. TCP services. Internet Transport Layer: introduction to the Transport Control Protocol (TCP) A MUCH more complex transport

Lecture 2: Topology - I

Point-to-Point Network Switching. Computer Networks Term B10

BlueGene/L. Computer Science, University of Warwick. Source: IBM

NOC Deadlock and Livelock

Boosting the Performance of Myrinet Networks

CS 349/449 Internet Protocols Midterm Exam Winter /21/2003

Deadlock and Livelock. Maurizio Palesi

Parallel Programming with MPI and OpenMP

A Novel Energy Efficient Source Routing for Mesh NoCs

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Evaluation of NOC Using Tightly Coupled Router Architecture

Storage Hierarchy Management for Scientific Computing

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

Basic Communication Operations (Chapter 4)

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Fast Parallel Sorting under LogP: Experience with the CM-5

EEC-484/584 Computer Networks

Basic Low Level Concepts

Performance Optimization Part II: Locality, Communication, and Contention

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

RCRT:Rate-Controlled Reliable Transport Protocol for Wireless Sensor Networks

Transcription:

EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1

Outline From last class A simple model of shared memory parallel computation Example shared memory programs OpenMP Today (Chapters 2.5, 2.6) Communication Cost in Parallel Machines Message Passing Machines Routing mechanisms Cut through routing LogP model Routing in interconnection network 2

Models Parallel programming model Parallel computation model 3

Communication Cost in Message Passing Machines (1) Message from source(s) to destination(d) d t s (startup time) time spent in handling the message at source and destination Prepare message (packet: header, data, error correction information) Message to router Routing algorithm execution (at the source node) Processing at the receiving end Incurred per message transfer s 4

Communication Cost in Message Passing Machines (2) t h (per hop time) time for header of message to travel from one node to the next node along the path Routing algorithm execution Buffer management s d t w (per word transfer time) incurred by every data word between adjacent nodes Channel bandwidth = r(words/min) t w = 1 r 5

Routing Mechanisms (1) 1. Store and Forward Routing Message length = m words Number of hops = l P 0 P 1 P 2 P 3 Example: l = 3 P 0 P 1 t s (t h + m t w ) source destination Store: each intermediate node receives the entire message from its predecessor Forward: forward to the next node P 2 P 3 time Total time for communication = t s + (t h + m t w ) l 6

Routing Mechanisms (2) 2. Cut through routing Objective: Reduce overall time for communication Some ideas to improve store and forward routing: Reduce packet header Fixed size message All packets follow the same route (reduce routing complexity) Simple error detection/correction (needed in large scale distributed systems) 7

Pipelined Message Routing Time P 0 P 1 P 2 A single message Store-and-Forward P 3 Time P 0 P 1 P 2 P 3 Time Same message in two parts Cut-through P 0 P 1 P 2 P 3 Same message in four parts Cut-through 8

Cut Through Routing (1) Message: Fixed size units= flow control digits (FLITS) Establish connection between source and destination Pipeline the data transfer each intermediate node receives a FLIT (not the entire message) and forwards along the same path No need for large buffer space & buffer management at each node No overhead of (variable size) packets (Small size) 9

Cut Through Routing (2) FLIT size (f) =? For pipelined data delivery: t 1 t 2 Note: t 2 f f should not be too small (t 1 cannot be very small high speed control circuit) t 1 Node Processing t 2 communication link time f should not be large need large buffer f= 4 to 32 bytes 10

Cut Through Routing (3) 1 2 l t comm = t s + l t h + t w m Pipeline delay Communication time on a link Pipelined Data Communication Note: There is a small overhead in processing FLIT at each intermediate node, smaller than processing a packet. t h is smaller than store and forward scenario. 11

Cut Through Routing (4) Note 1: t s t h t w Hardware, software layers (OS, buffer management) Message Semantics t s ~ 10 100 s μsec t h ~ 10 s μsec t w ~ 10-10 Sec/byte (10 GB/Sec link) 12

Cut Through Routing (5) Note 2: Latency? Throughput? Short messages (Control message of each link) latency is important Long messages (Data communication) Effective throughput is important Virtual channels In cut through routing, a long message may hold up links delaying the (short) messages Idea: share a physical channel as virtual channels among several messages Deadlock may occur 13

Cut Through Routing (6) Communication cost model t comm = t s + l t h + t w m Cut through routing is widely used Some optimizations (application developer may do) Communicate in bulk (startup latency t s ) - Aggregate short message into one long message Reduce total data volume communicated Minimize distance between source and destination (number of hops l) 14

Cut Through Routing (7) Minimizing l (number of hops) is hard Program has little control in program to processor mapping Machines (platform) use various routing techniques to minimize congestion (ex: randomized routing) Per hop time is usually relatively small compared with t s and t w m for large m Simple communication cost model: t comm = t s + t w m Same amount of time to communicate between any two nodes (fully connected?) Use this cost model to design and optimize instead of parallel architecture (mesh, hypercube) specific algorithms Simplified cost model does not take congestion into account 15

Cut Through Routing (8) Effective bandwidth (at the application layer) Parallel architecture Communication pattern Routing algorithm Communication schedule Resulting congestion 16

LogP Model (1) Parallel Architecture = Processor + Memory + Interconnection This organization characterizes most massively parallel processors (MPPs) (From LogP paper) 17

LogP Model (2) Paper: LogP: Towards a Realistic Model of Parallel Computation EECS Department, University of California, Berkeley Authors: David E. Culler, Richard Karp, David A. Patterson, et al. Published in: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '93), Pages 1-12, New York, NY, USA, 1993 Link: http://dl.acm.org/citation.cfm?id=155333 18

LogP Model (3) P processors P-M latency P-M throughput M M M o overhead P P P g gap L (latency) Communication Network 19

LogP Model (4) L: an upper bound on the latency, or delay, incurred in communicating a message containing a word from its source module to its target module o: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message g: the gap, defined as the min. time interval between consecutive message transmission or reception of each message P: the number of processors/memory modules 20

LogP Model (5) Unit of Time 1 unit= 1 processor cycle L, o, g are all multiples of this For example: processor cycle = 0.5 nsec (= 1 unit) L = 1000 cycles (depends on communication type: blocking, buffering) o = 100 cycles g = 1 cycle 21

LogP Model (6) Matrix multiplication (C = A B) on a p p 2-D mesh Matrix size = N N A, B,C are partitioned into p p blocks P (i,j) owns A (i,j) and B (i,j) in local memory (1 i, j p) P (i,j) is responsible for computing C (i,j) Computation time is negligible N 1,1 1, p p, 1 Example (1) p, p N p 22 M P (1,1) Communication Network M P (p,p)

Routing in Interconnection Networks (1) Routing Mechanism Path to be taken from source to destination of a message Based on source nodes, destination nodes, state of the network (for example: level of congestion) Minimal routing shortest path from source to destination Note: minimal routing may result in congestion when several messages are routed concurrently Ex: matrix transpose in a 2-D mesh Adaptive routing routing based on network state Ex: to avoid or minimize network congestion 23

Routing in Interconnection Networks (2) Ex: Dimension ordered routing X-Y routing on a 2-D mesh (without wrap around) For each message go along x axis to its destination column and then travel along the column to reach the final destination This is minimal routing For example: Matrix transpose Assume one step communication along x axis communication along y axis n n mesh i-th row data passes through PE(i, i) PE(i, i) needs Θ(n) storage 1 2 1 2 3 4 4 24

Communication Cost in Shared Address Space Machines (1) Cache coherent multi-processors Hardware mechanisms to support access to variables across processors User does not explicitly manage data placement or communication Compilation and run time support for optimization m L p c 25

Communication Cost in Shared Address Space Machines (2) Communication cost is hard to model Data layout determined by the system (compiler, run time system) Cost of access to local data << remote data Cache behavior Data used by a process can be >> cache size Cache coherence protocol Compiler optimizations Ex: prefetching Runtime optimizations Ex: process migration data migration 26

Summary Communication Cost in Message Passing Machines Store and forward Cut through Cost models t comm = t s + l t h + t w m t comm = t s + t w m LogP Parallel Machine Model Routing in interconnection networks 27

Backup Slides 28

LogP Model (6) Matrix multiplication (C = A B) on a p p 2-D mesh Matrix size = N N A, B,C are partitioned into p p blocks P (i,j) owns A (i,j) and B (i,j) in local memory (1 i, j p) P (i,j) is responsible for computing C (i,j) Computation time is negligible N 1,1 1, p p, 1 Example (1) p, p N p 29 M P (1,1) Communication Network M P (p,p)

LogP Model (6) Example (2) Each processor needs to receive 2 (p 1) N N P messages from other processors Each processor needs to send 2 (p 1) N N P messages to other processors Total execution time = 2o + L + 4 (p 1) N2 P max(o, g) 30