Overview. Some Definitions. Some definitions. High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

Similar documents
High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

High Performance Computing Programming Paradigms and Scalability

Multiprocessors. HPC Prof. Robert van Engelen

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

On Nonblocking Folded-Clos Networks in Computer Communication Environments

Switch Construction CS

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

1. SWITCHING FUNDAMENTALS

The Counterchanged Crossed Cube Interconnection Network and Its Topology Properties

Introduction to Network Technologies & Layered Architecture BUPT/QMUL

Ones Assignment Method for Solving Traveling Salesman Problem

The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems

condition w i B i S maximum u i

Media Access Protocols. Spring 2018 CS 438 Staff, University of Illinois 1

Course Information. Details. Topics. Network Examples. Overview. Walrand Lecture 1. EECS 228a. EECS 228a Lecture 1 Overview: Networks

Elementary Educational Computer

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Operating System Concepts. Operating System Concepts

Minimum Spanning Trees

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Transitioning to BGP

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

The Magma Database file formats

Computer Graphics Hardware An Overview

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Computational Geometry

SCI Reflective Memory

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

. Written in factored form it is easy to see that the roots are 2, 2, i,

Greedy Algorithms. Interval Scheduling. Greedy Algorithms. Interval scheduling. Greedy Algorithms. Interval Scheduling

MOTIF XF Extension Owner s Manual

Lecture 2: Spectra of Graphs

Chapter 3 Classification of FFT Processor Algorithms

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Pattern Recognition Systems Lab 1 Least Mean Squares

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Combination Labelings Of Graphs

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

BOOLEAN MATHEMATICS: GENERAL THEORY

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

1 Graph Sparsfication

Message Integrity and Hash Functions. TELE3119: Week4

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

The isoperimetric problem on the hypercube

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Adaptive Graph Partitioning Wireless Protocol S. L. Ng 1, P. M. Geethakumari 1, S. Zhou 2, and W. J. Dewar 1 1

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

Slides are an edited mashup of two books

Lecture 1: Introduction and Strassen s Algorithm

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

The CCITT Communication Protocol for Videophone Teleconferencing Equipment

WEBSITE STRUCTURE IMPROVEMENT USING ANT COLONY TECHNIQUE

IS-IS in Detail. ISP Workshops

Properties and Embeddings of Interconnection Networks Based on the Hexcube

Private Key Cryptography. TELE3119: Week2

Introduction to OSPF. ISP Training Workshops

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Security of Bluetooth: An overview of Bluetooth Security

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation

System Overview. Hardware Concept. s Introduction to the Features of MicroAutoBox t

Architectural styles for software systems The client-server style

THE WAY OF CALCULATING THE TRAFFIC AND SIGNALING NETWORK DIMENSION OF COMMON CHANNEL SIGNALING NO.7 (CCS7)

An Efficient Algorithm for Graph Bisection of Triangularizations

Graphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU)

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Average Connectivity and Average Edge-connectivity in Graphs

New Results on Energy of Graphs of Small Order

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

EECS 122, Lecture 24 Introduction to the Telephone Network. Kevin Fall Jean Walrand

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

On Multicast Scheduling and Routing in Multistage Clos Networks

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

CIS 121. Introduction to Trees

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

Random Graphs and Complex Networks T

Τεχνολογία Λογισμικού

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Name of the Student: Unit I (Logic and Proofs) 1) Truth Table: Conjunction Disjunction Conditional Biconditional

Algorithms for Disk Covering Problems with the Most Points

Session Initiated Protocol (SIP) and Message-based Load Balancing (MBLB)

The Value of Peering

INTERSECTION CORDIAL LABELING OF GRAPHS

Traditional queuing behaviour in routers. Scheduling and queue management. Questions. Scheduling mechanisms. Scheduling [1] Scheduling [2]

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP

Computer Systems - HS

WYSE Academic Challenge Sectional Computer Science 2005 SOLUTION SET

An Efficient Algorithm for Graph Bisection of Triangularizations

A QoS Provisioning mechanism of Real-time Wireless USB Transfers for Smart HDTV Multimedia Services

Transcription:

Overview High Performace Computig Programmig Paradigms ad Scalability Part : High-Performace Networks some defiitios static etwork topologies dyamic etwork topologies examples PD Dr. rer. at. habil. Ralf-Peter Mudai Computatio i Egieerig (CiE) Scietific Computig (SCCS) k is eough for ayoe, ad by the way, what s a etwork? William Gates III, chairma Microsoft Corp., 98 Summer Term Some defiitios Some Defiitios remider: protocols -compoet model ISOOSI model iteret protocols (examples) degree (ode degree) umber of coectios (icomig ad outgoig) betwee this ode ad other odes applicatio commuicatio system applicatio layer presetatio layer sessio layer trasport layer data trasfer, email TCP, UDP degree of a etwork max. degree of all odes i the etwork higher degrees lead to more parallelism ad badwidth for the commuicatio more costs (due to a higher amout of coectios) objective: keep degree ad, thus, costs small etwork layer IP, ICMP, IGMP etwork logical lik cotrol data lik layer medium access cotrol physical layer etwork adaptatio degree degree

Some Defiitios diameter distace of a pair of odes (legth of the shortest path betwee a pair of odes), i.e. the amout of odes a message has to pass o its way from the seder to the receiver diameter of a etwork max. distace of all pairs of odes i the etwork higher diameters (betwee two odes) lead to loger commuicatios less fault tolerace (due to the higher amout of odes that have to work properly) objective: small diameter Some Defiitios coectivity mi. amout of edges (cables) that have to be removed to discoect the etwork, i.e. the etwork falls apart ito two loose sub-etworks higher coectivity leads to more idepedet paths betwee two odes better fault tolerace (due to more routig possibilities) faster commuicatio (due to the avoidace of cogestios i the etwork) objective: high coectivity coectivity diameter Some Defiitios bisectio width mi. amout of edges (cables) that have to be removed to separate the etwork ito two equal parts (bisectio width coectivity, see below) importat for determiig the amout of messages that ca be trasmitted i parallel betwee oe half of the odes to the other half without the repeated usage of ay coectio extreme case: Etheret with bisectio width objective: high bisectio width (ideal: amout of odes) bisectio width (coectivity ) Some Defiitios blockig a desired coectio betwee two odes caot be established due to already existig coectios betwee other pairs of odes objective: o-blockig etworks fault tolerace (redudacy) coectios betwee (arbitrary) odes ca still be established eve uder the breakdow of sigle compoets a fault-tolerat etwork has to provide at least oe redudat path betwee all arbitrary pairs of odes graceful degradatio: the ability of a system to stay fuctioal (maybe with less performace) eve uder the breakdow of sigle compoets 8

Some Defiitios badwidth max. trasmissio performace of a etwork for a certai amout of time badwidth B i geeral measured as megabits or megabytes per secod (Mbps or MBps, resp.), owadays more ofte as gigabits or gigabytes per secod (Gbps or GBps, resp.) Overview some defiitios static etwork topologies dyamic etwork topologies examples bisectio badwidth max. trasmissio performace of a etwork over the bisectio lie, i.e. sum of sigle badwidths from all edges (cables) that are cut whe bisectig the etwork thus bisectio badwidth is a measure of bottleeck badwidth uits are same as for badwidth 9 to be distiguished static etworks fixed coectios betwee pairs of odes cotrol fuctios are doe by the odes or by special coectio hardware dyamic etworks o fixed coectios betwee pairs of odes all odes are coected via iputs ad outputs to a so called switchig compoet cotrol fuctios are cocetrated i the switchig compoet various routes ca be switched chai (liear array) oe-dimesioal etwork N odes ad N edges degree diameter N bisectio width drawback: too slow for large N

rig two-dimesioal etwork N odes ad N edges degree diameter N bisectio width drawback: too slow for large N how about fault tolerace? chordal rig two-dimesioal etwork N odes ad N, N, N, edges degree,,, higher degrees lead to smaller diameters higher fault tolerace (due to redudat coectios) drawback: higher costs rig with degree (left) ad degree (right) completely coected two-dimesioal etwork star two-dimesioal etwork N odes ad N (N) edges degree N diameter bisectio width N N very high fault tolerace drawback: too expesive for large N N odes ad N edges degree N diameter bisectio width N drawback: bottleeck i cetral ode

biary tree two-dimesioal etwork N odes ad N edges (tree height h ld N ) degree diameter h bisectio width drawback: bottleeck i directio of root ( blockig) biary tree (cot d) addressig label o level m cosists of m bits; root has label suffix is added to left so, suffix is added to right so routig fid commo paret ode P of odes S ad D asced from S to P desced from P to D P S D 8 biary tree (cot d) solutio to overcome the bottleeck fat tree edges o level m get higher priority tha edges o level m capacity is doubled o each higher level ow, bisectio width h frequetly used: HLRB II, e.g. mesh torus k-dimesioal etwork N odes ad k (Nr k ) edges (r k N ) degree k diameter k (r) bisectio width r k high fault tolerace drawback large diameter too expesive for k 9

mesh torus (cot d) k-dimesioal mesh with cyclic coectios i each dimesio N odes ad k N edges (r k N ) diameter k r bisectio width r k frequetly used: BlueGeeL, e.g. drawback: too expesive for k ILLIAC mesh two-dimesioal etwork N odes ad N edges (rr mesh, r N ) degree diameter r bisectio width r coforms to a chordal rig of degree hypercube k-dimesioal etwork k odes ad k k edges degree k diameter k bisectio width k drawback: scalability (oly doublig of odes allowed) hypercube (cot d) priciple desig costructio of a k-dimesioal hypercube via coectio of the correspodig odes of two k-dimesioal hypercubes iheret labellig via addig prefix to oe sub-cube ad prefix to the other sub-cube D D D D

hypercube (cot d) odes are directly coected for a HAMMING distace of oly routig compute S D (xor) for possible ways betwee odes S ad D route frames i icreasigly decreasigly order util fial destiatio is reached Overview some defiitios static etwork topologies dyamic etwork topologies examples example S, D S D decreasig: icreasig: D S bus simple ad cheap sigle stage etwork shared usage from all coected odes, thus, just oe frame trasfer at ay poit i time frame trasfer i oe step (i.e. diameter ) good extesibility, but bad scalability fault tolerace oly for multiple bus systems example: Etheret crossbar completely coected etwork with all possible permutatios of N iputs ad N outputs (i geeral NM iputs outputs) switch elemets allow simultaeous commuicatio betwee all possible disjoit pairs of iputs ad outputs without blockig very fast (diameter ), but expesive due to N switch elemets used for processor processor ad processor memory couplig example: The Earth Simulator iput sigle bus multiple bus (here dual) switch elemet output 8

permutatio etworks tradeoff betwee low performace of buses ad high hardware costs of crossbars ofte crossbar as basic elemet N iputs ca simultaeously be switched to N outputs permutatio of iputs (to outputs) sigle stage: cosists of oe colum of switch elemets multistage: cosists of several of those colums straight crossed upper broadcast lower broadcast permutatio etworks (cot d) permutatios: uique (bijective) mappig of iputs to outputs addressig label iputs from to N (i case of N switch elemets) write labels i biary represetatio (a K, a K,, a, a ) permutatios ca ow be expressed as simple bit maipulatio typical permutatios perfect shuffle butterfly exchage 9 permutatio etworks (cot d) perfect shuffle permutatio cyclic left shift P(a K, a K,, a, a ) (a K,, a, a, a K ) permutatio etworks (cot d) butterfly permutatio exchage of first highest ad last lowest bit B(a K, a K,, a, a ) (a, a K,, a, a K ) a a a a a a a a a a a a

permutatio etworks (cot d) exchage permutatio egatio of last lowest bit E(a K, a K,, a, a ) (a K, a K,, a, ā ) permutatio etworks (cot d) example: perfect shuffle coectio patter problem: ot all destiatios are accessible from a source a a a a a ā permutatio etworks (cot d) addig additioal exchage permutatios ( shuffle-exchage) all destiatios are ow accessible from ay source omega based o the shuffle-exchage coectio patter exchage permutatios replaced by switch elemets

omega (cot d) multistage etwork with N odes ad E Nld N switch elemets diameter ld N (all stages have to be passed) N! permutatios possible, but oly E differet switch states (self cofigurig) routig compare addresses from S ad D bitwise from left to right, i.e. stage i evaluates address bits s i ad d i if equal switch straight (), otherwise switch crossed () example S, D switch states: omega (cot d) omega is a bidelta etwork operates also backwards drawback: blockig possible 8 baya butterfly idea: urollig of a static hypercube bitwise processig of address bits a i from left to right dyamic hypercube a.k.a. butterfly (kow from FFT flow diagram) baya butterfly (cot d) replace crossed coectios by switch elemets itroduced by GOKE ad LIPOVSKI i 9; blockig still possible baya tree 9

BENEŠ multistage etwork with N odes ad N(ld N)N switch elemets butterfly merged at the last colum with its copied mirror diameter (ld N) N! permutatios possible, all ca be switched key property: for ay permutatio of iputs to outputs there is a cotetio-free routig BENEŠ (cot d) example S, D ad S, D blockig for butterfly BENEŠ (cot d) example S, D ad S, D o blockig for BENEŠ CLOS proposed by CLOS i 9 for telephoe switchig systems objective: overcome the costs of crossbars (N switch elemets) idea: replace the etire crossbar with three stages of smaller oes igress stage: R crossbars with NM iputs outputs middle stage: M crossbars with RR iputs outputs egress stage: R crossbars with MN iputs outputs thus much fewer switch elemets tha for the etire system ay icomig frame is routed from the iput via oe of the middle stage crossbars to the respective output a middle stage crossbar is available if both liks to the igress ad egress stage are free

CLOS (cot d) RN iputs ca be assiged to RN outputs CLOS (cot d) relative values of M ad N defie the blockig characteristics m r r m M N: rearrageable o-blockig a free iput ca always be coected to a free output existig coectios might be assiged to differet middle stage crossbars (rearragemet) m r r m M N: strict-sese o-blockig a free iput ca always be coected to a free output o re-assigmet ecessary r m r m r r m remider: bipartite graph defiitio: a graph whose vertices ca be divided ito two disjoit sets U ad V such that every edge coects a vertex i U to oe i V; that is, U ad V are each idepedet sets remider: perfect matchig defiitio: perfect matchig (a.k.a. -factor) is a matchig that matches all vertices of a graph, i.e. every vertex is icidet to exactly oe edge of the matchig urse pilot lawyer A N urse Alice Bob B P ilot U V Carol C L awyer divisio of vertices i U ad V, i.e. there are o edges withi U ad V, oly betwee U ad V problem: perfect matchig for bipartite graph to be foud 8

CLOS (cot d) proof for M N via HALL s Marriage Theorem () Let G (V IN, V OUT, E) be a bipartite graph. A perfect matchig for G is a ijective fuctio f : V IN V OUT so that for every x V IN, there is a edge i E whose edpoits are x ad f(x). Oe would expect a perfect matchig to exist if G cotais eough edges, i.e. if for every subset A V IN the image set A V OUT is sufficiet large. Theorem: G has a perfect matchig if ad oly if for every subset A V IN the iequality A A holds. Ofte explaied as follows: Imagie two groups of N me ad N wome. If ay subset of S boys (where S N) kows S or more girls, each boy ca be married with a girl he kows. CLOS (cot d) proof for M N via HALL s Marriage Theorem () boy igress stage crossbar girl egress stage crossbar a boy kows a girl if there exists a (direct) coectio betwee them assume there s oe free iput ad oe free output left ) for S R boys there are SN coectios at least S girls ) thus, HALL s theorem states there exists a perfect matchig ) R coectios ca be hadled by oe middle stage crossbar ) budle these coectios ad delete the middle stage crossbar ) repeat from step ) util M ) ew coectio ca be hadled, maybe rearragemet ecessary 9 CLOS (cot d) proof for M N via HALL s Marriage Theorem () example: M N iitial situatio: two coectios caot be established budle coectios o oe middle stage crossbar ad delete it afterwards maybe rearragemets are ecessary repeat steps util M, the all coectios should be possible CLOS (cot d) proof for M N via worst case sceario crossbar with N iputs ad crossbar with N outputs, all coected to differet middle stage crossbars oe further coectio

costat bisectio badwidth (CBB) more geeral cocept of CLOS ad fat tree etworks costructio of a o-blockig etwork coectig M odes usig multiple levels of basic NN switch elemets (M N) for ay give level, dowstream BW (i directio to odes) is idetical to upstream BW (i directio to itercoectio) key for o-blockig: always preserve idetical badwidth (upstream ad dowstream) betwee ay two levels observatio: for two-stage costat bisectio badwidth etworks coectig M odes always M ports (i.e. sum of iputs ad outputs) are ecessary CBB frequetly used for high-speed itercoects (IfiiBad, e.g.) costat bisectio badwidth (cot d) example: CBB coectig odes with switch elemets i total 8 ports (i.e. switch elemets) are ecessary level level Overview some defiitios static etwork topologies dyamic etwork topologies examples Examples i the past years, differet (proprietary) high-performace etworks have established o the market typically, these cosist of a static ad or dyamic etwork topology sophisticated etwork iterface cards (NIC) popular etworks Myriet IfiiBad Scalable Coheret Iterface (SCI)

Examples Examples Myriet developed by Myricom (99) for clusters Myriet (cot d) programmig model particularly efficiet due to usage of oboard (NIC) processors for protocol offload ad low-latecy, kerel-bypass operatios (ParaStatio, e.g.) highly scalable, cut-through switchig TCP Applicatio UDP low level message passig switchig rearrageable o-blockig CLOS (8 odes) spie of CLOS etwork cosists of eight crossbars odes are coected via lie-cards with 88 crossbar each OS kerel Etheret IP Myriet mmap proprietary protocol (ParaStatio, e. g.) Myriet GM API Etheret Myriet 8 Examples IfiiBad uificatio of two competig efforts i 999 Future IO iitiative (Compaq, IBM, HP) Next-Geeratio IO iitiative (Dell, Itel, SUN et al.) idea: itroductio of a future IO stadard as successor for PCI overcome the bottleeck of limited IO badwidth coectio of hosts (via host chael adapters (HCA)) ad devices (via target chael adapters (TCA)) to the IO fabric switched poit-to-poit bidirectioal liks bodig of liks for badwidth improvemets: (up to Gbps), (up to Gbps), 8 (up to Gbps), ad (up to Gbps) owadays oly used for cluster coectio Examples IfiiBad (cot d) particularly efficiet (amog others) due to protocol offload ad reduced CPU utilisatio Remote Direct Memory Access (RDMA), i.e. direct R/W access via HCA to local/remote memory without CPU usage/iterrupts switchig: costat bisectio badwidth (up to 88 odes) CPU CPU memory cotroller memory HCA ode lik Switch TCA HCA 9

Examples Scalable Coheret Iterface (SCI) origiated as a offshoot from IEEE Futurebus project i 988 became IEEE stadard i 99 SCI is a high performace itercoect techology that coects up to, odes (both hosts ad devices) supports remote memory access for read/write (NUMA) uses packet switchig poit-to-poit commuicatio Examples Scalable Coheret Iterface (cot d) shared memory: SCI uses a -bit fixed addressig scheme upper bits: ode o which physical storage is located lower 8 bits: local physical address withi memory hece, ay physical memory locatio of the etire memory space ca be mapped ito a ode s local memory virtual address space P virtual address space P SCI cotroller moitors IO trasactios (memory) to assure cache coherece of all attached odes, i.e. all write accesses that ivalidate cache etries of other SCI modules are detected performace: up to GBps with latecies smaller tha s differet topologies such as rig or torus possible ode A mmap import export SCI address space mmap physical address space ode B