Overview. Some Definitions. Some definitions. High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

Overview High Performace Computig Programmig Paradigms ad Scalability Part : High-Performace Networks some defiitios static etwork topologies dyamic etwork topologies examples PD Dr. rer. at. habil. Ralf-Peter Mudai Computatio i Egieerig (CiE) Scietific Computig (SCCS) k is eough for ayoe, ad by the way, what s a etwork? William Gates III, chairma Microsoft Corp., 98 Summer Term Some defiitios Some Defiitios remider: protocols -compoet model ISOOSI model iteret protocols (examples) degree (ode degree) umber of coectios (icomig ad outgoig) betwee this ode ad other odes applicatio commuicatio system applicatio layer presetatio layer sessio layer trasport layer data trasfer, email TCP, UDP degree of a etwork max. degree of all odes i the etwork higher degrees lead to more parallelism ad badwidth for the commuicatio more costs (due to a higher amout of coectios) objective: keep degree ad, thus, costs small etwork layer IP, ICMP, IGMP etwork logical lik cotrol data lik layer medium access cotrol physical layer etwork adaptatio degree degree

Some Defiitios diameter distace of a pair of odes (legth of the shortest path betwee a pair of odes), i.e. the amout of odes a message has to pass o its way from the seder to the receiver diameter of a etwork max. distace of all pairs of odes i the etwork higher diameters (betwee two odes) lead to loger commuicatios less fault tolerace (due to the higher amout of odes that have to work properly) objective: small diameter Some Defiitios coectivity mi. amout of edges (cables) that have to be removed to discoect the etwork, i.e. the etwork falls apart ito two loose sub-etworks higher coectivity leads to more idepedet paths betwee two odes better fault tolerace (due to more routig possibilities) faster commuicatio (due to the avoidace of cogestios i the etwork) objective: high coectivity coectivity diameter Some Defiitios bisectio width mi. amout of edges (cables) that have to be removed to separate the etwork ito two equal parts (bisectio width coectivity, see below) importat for determiig the amout of messages that ca be trasmitted i parallel betwee oe half of the odes to the other half without the repeated usage of ay coectio extreme case: Etheret with bisectio width objective: high bisectio width (ideal: amout of odes) bisectio width (coectivity ) Some Defiitios blockig a desired coectio betwee two odes caot be established due to already existig coectios betwee other pairs of odes objective: o-blockig etworks fault tolerace (redudacy) coectios betwee (arbitrary) odes ca still be established eve uder the breakdow of sigle compoets a fault-tolerat etwork has to provide at least oe redudat path betwee all arbitrary pairs of odes graceful degradatio: the ability of a system to stay fuctioal (maybe with less performace) eve uder the breakdow of sigle compoets 8

Some Defiitios badwidth max. trasmissio performace of a etwork for a certai amout of time badwidth B i geeral measured as megabits or megabytes per secod (Mbps or MBps, resp.), owadays more ofte as gigabits or gigabytes per secod (Gbps or GBps, resp.) Overview some defiitios static etwork topologies dyamic etwork topologies examples bisectio badwidth max. trasmissio performace of a etwork over the bisectio lie, i.e. sum of sigle badwidths from all edges (cables) that are cut whe bisectig the etwork thus bisectio badwidth is a measure of bottleeck badwidth uits are same as for badwidth 9 to be distiguished static etworks fixed coectios betwee pairs of odes cotrol fuctios are doe by the odes or by special coectio hardware dyamic etworks o fixed coectios betwee pairs of odes all odes are coected via iputs ad outputs to a so called switchig compoet cotrol fuctios are cocetrated i the switchig compoet various routes ca be switched chai (liear array) oe-dimesioal etwork N odes ad N edges degree diameter N bisectio width drawback: too slow for large N

rig two-dimesioal etwork N odes ad N edges degree diameter N bisectio width drawback: too slow for large N how about fault tolerace? chordal rig two-dimesioal etwork N odes ad N, N, N, edges degree,,, higher degrees lead to smaller diameters higher fault tolerace (due to redudat coectios) drawback: higher costs rig with degree (left) ad degree (right) completely coected two-dimesioal etwork star two-dimesioal etwork N odes ad N (N) edges degree N diameter bisectio width N N very high fault tolerace drawback: too expesive for large N N odes ad N edges degree N diameter bisectio width N drawback: bottleeck i cetral ode

biary tree two-dimesioal etwork N odes ad N edges (tree height h ld N ) degree diameter h bisectio width drawback: bottleeck i directio of root ( blockig) biary tree (cot d) addressig label o level m cosists of m bits; root has label suffix is added to left so, suffix is added to right so routig fid commo paret ode P of odes S ad D asced from S to P desced from P to D P S D 8 biary tree (cot d) solutio to overcome the bottleeck fat tree edges o level m get higher priority tha edges o level m capacity is doubled o each higher level ow, bisectio width h frequetly used: HLRB II, e.g. mesh torus k-dimesioal etwork N odes ad k (Nr k ) edges (r k N ) degree k diameter k (r) bisectio width r k high fault tolerace drawback large diameter too expesive for k 9

mesh torus (cot d) k-dimesioal mesh with cyclic coectios i each dimesio N odes ad k N edges (r k N ) diameter k r bisectio width r k frequetly used: BlueGeeL, e.g. drawback: too expesive for k ILLIAC mesh two-dimesioal etwork N odes ad N edges (rr mesh, r N ) degree diameter r bisectio width r coforms to a chordal rig of degree hypercube k-dimesioal etwork k odes ad k k edges degree k diameter k bisectio width k drawback: scalability (oly doublig of odes allowed) hypercube (cot d) priciple desig costructio of a k-dimesioal hypercube via coectio of the correspodig odes of two k-dimesioal hypercubes iheret labellig via addig prefix to oe sub-cube ad prefix to the other sub-cube D D D D

hypercube (cot d) odes are directly coected for a HAMMING distace of oly routig compute S D (xor) for possible ways betwee odes S ad D route frames i icreasigly decreasigly order util fial destiatio is reached Overview some defiitios static etwork topologies dyamic etwork topologies examples example S, D S D decreasig: icreasig: D S bus simple ad cheap sigle stage etwork shared usage from all coected odes, thus, just oe frame trasfer at ay poit i time frame trasfer i oe step (i.e. diameter ) good extesibility, but bad scalability fault tolerace oly for multiple bus systems example: Etheret crossbar completely coected etwork with all possible permutatios of N iputs ad N outputs (i geeral NM iputs outputs) switch elemets allow simultaeous commuicatio betwee all possible disjoit pairs of iputs ad outputs without blockig very fast (diameter ), but expesive due to N switch elemets used for processor processor ad processor memory couplig example: The Earth Simulator iput sigle bus multiple bus (here dual) switch elemet output 8

permutatio etworks tradeoff betwee low performace of buses ad high hardware costs of crossbars ofte crossbar as basic elemet N iputs ca simultaeously be switched to N outputs permutatio of iputs (to outputs) sigle stage: cosists of oe colum of switch elemets multistage: cosists of several of those colums straight crossed upper broadcast lower broadcast permutatio etworks (cot d) permutatios: uique (bijective) mappig of iputs to outputs addressig label iputs from to N (i case of N switch elemets) write labels i biary represetatio (a K, a K,, a, a ) permutatios ca ow be expressed as simple bit maipulatio typical permutatios perfect shuffle butterfly exchage 9 permutatio etworks (cot d) perfect shuffle permutatio cyclic left shift P(a K, a K,, a, a ) (a K,, a, a, a K ) permutatio etworks (cot d) butterfly permutatio exchage of first highest ad last lowest bit B(a K, a K,, a, a ) (a, a K,, a, a K ) a a a a a a a a a a a a

permutatio etworks (cot d) exchage permutatio egatio of last lowest bit E(a K, a K,, a, a ) (a K, a K,, a, ā ) permutatio etworks (cot d) example: perfect shuffle coectio patter problem: ot all destiatios are accessible from a source a a a a a ā permutatio etworks (cot d) addig additioal exchage permutatios ( shuffle-exchage) all destiatios are ow accessible from ay source omega based o the shuffle-exchage coectio patter exchage permutatios replaced by switch elemets

omega (cot d) multistage etwork with N odes ad E Nld N switch elemets diameter ld N (all stages have to be passed) N! permutatios possible, but oly E differet switch states (self cofigurig) routig compare addresses from S ad D bitwise from left to right, i.e. stage i evaluates address bits s i ad d i if equal switch straight (), otherwise switch crossed () example S, D switch states: omega (cot d) omega is a bidelta etwork operates also backwards drawback: blockig possible 8 baya butterfly idea: urollig of a static hypercube bitwise processig of address bits a i from left to right dyamic hypercube a.k.a. butterfly (kow from FFT flow diagram) baya butterfly (cot d) replace crossed coectios by switch elemets itroduced by GOKE ad LIPOVSKI i 9; blockig still possible baya tree 9

BENEŠ multistage etwork with N odes ad N(ld N)N switch elemets butterfly merged at the last colum with its copied mirror diameter (ld N) N! permutatios possible, all ca be switched key property: for ay permutatio of iputs to outputs there is a cotetio-free routig BENEŠ (cot d) example S, D ad S, D blockig for butterfly BENEŠ (cot d) example S, D ad S, D o blockig for BENEŠ CLOS proposed by CLOS i 9 for telephoe switchig systems objective: overcome the costs of crossbars (N switch elemets) idea: replace the etire crossbar with three stages of smaller oes igress stage: R crossbars with NM iputs outputs middle stage: M crossbars with RR iputs outputs egress stage: R crossbars with MN iputs outputs thus much fewer switch elemets tha for the etire system ay icomig frame is routed from the iput via oe of the middle stage crossbars to the respective output a middle stage crossbar is available if both liks to the igress ad egress stage are free

CLOS (cot d) RN iputs ca be assiged to RN outputs CLOS (cot d) relative values of M ad N defie the blockig characteristics m r r m M N: rearrageable o-blockig a free iput ca always be coected to a free output existig coectios might be assiged to differet middle stage crossbars (rearragemet) m r r m M N: strict-sese o-blockig a free iput ca always be coected to a free output o re-assigmet ecessary r m r m r r m remider: bipartite graph defiitio: a graph whose vertices ca be divided ito two disjoit sets U ad V such that every edge coects a vertex i U to oe i V; that is, U ad V are each idepedet sets remider: perfect matchig defiitio: perfect matchig (a.k.a. -factor) is a matchig that matches all vertices of a graph, i.e. every vertex is icidet to exactly oe edge of the matchig urse pilot lawyer A N urse Alice Bob B P ilot U V Carol C L awyer divisio of vertices i U ad V, i.e. there are o edges withi U ad V, oly betwee U ad V problem: perfect matchig for bipartite graph to be foud 8

CLOS (cot d) proof for M N via HALL s Marriage Theorem () Let G (V IN, V OUT, E) be a bipartite graph. A perfect matchig for G is a ijective fuctio f : V IN V OUT so that for every x V IN, there is a edge i E whose edpoits are x ad f(x). Oe would expect a perfect matchig to exist if G cotais eough edges, i.e. if for every subset A V IN the image set A V OUT is sufficiet large. Theorem: G has a perfect matchig if ad oly if for every subset A V IN the iequality A A holds. Ofte explaied as follows: Imagie two groups of N me ad N wome. If ay subset of S boys (where S N) kows S or more girls, each boy ca be married with a girl he kows. CLOS (cot d) proof for M N via HALL s Marriage Theorem () boy igress stage crossbar girl egress stage crossbar a boy kows a girl if there exists a (direct) coectio betwee them assume there s oe free iput ad oe free output left ) for S R boys there are SN coectios at least S girls ) thus, HALL s theorem states there exists a perfect matchig ) R coectios ca be hadled by oe middle stage crossbar ) budle these coectios ad delete the middle stage crossbar ) repeat from step ) util M ) ew coectio ca be hadled, maybe rearragemet ecessary 9 CLOS (cot d) proof for M N via HALL s Marriage Theorem () example: M N iitial situatio: two coectios caot be established budle coectios o oe middle stage crossbar ad delete it afterwards maybe rearragemets are ecessary repeat steps util M, the all coectios should be possible CLOS (cot d) proof for M N via worst case sceario crossbar with N iputs ad crossbar with N outputs, all coected to differet middle stage crossbars oe further coectio

costat bisectio badwidth (CBB) more geeral cocept of CLOS ad fat tree etworks costructio of a o-blockig etwork coectig M odes usig multiple levels of basic NN switch elemets (M N) for ay give level, dowstream BW (i directio to odes) is idetical to upstream BW (i directio to itercoectio) key for o-blockig: always preserve idetical badwidth (upstream ad dowstream) betwee ay two levels observatio: for two-stage costat bisectio badwidth etworks coectig M odes always M ports (i.e. sum of iputs ad outputs) are ecessary CBB frequetly used for high-speed itercoects (IfiiBad, e.g.) costat bisectio badwidth (cot d) example: CBB coectig odes with switch elemets i total 8 ports (i.e. switch elemets) are ecessary level level Overview some defiitios static etwork topologies dyamic etwork topologies examples Examples i the past years, differet (proprietary) high-performace etworks have established o the market typically, these cosist of a static ad or dyamic etwork topology sophisticated etwork iterface cards (NIC) popular etworks Myriet IfiiBad Scalable Coheret Iterface (SCI)

Examples Examples Myriet developed by Myricom (99) for clusters Myriet (cot d) programmig model particularly efficiet due to usage of oboard (NIC) processors for protocol offload ad low-latecy, kerel-bypass operatios (ParaStatio, e.g.) highly scalable, cut-through switchig TCP Applicatio UDP low level message passig switchig rearrageable o-blockig CLOS (8 odes) spie of CLOS etwork cosists of eight crossbars odes are coected via lie-cards with 88 crossbar each OS kerel Etheret IP Myriet mmap proprietary protocol (ParaStatio, e. g.) Myriet GM API Etheret Myriet 8 Examples IfiiBad uificatio of two competig efforts i 999 Future IO iitiative (Compaq, IBM, HP) Next-Geeratio IO iitiative (Dell, Itel, SUN et al.) idea: itroductio of a future IO stadard as successor for PCI overcome the bottleeck of limited IO badwidth coectio of hosts (via host chael adapters (HCA)) ad devices (via target chael adapters (TCA)) to the IO fabric switched poit-to-poit bidirectioal liks bodig of liks for badwidth improvemets: (up to Gbps), (up to Gbps), 8 (up to Gbps), ad (up to Gbps) owadays oly used for cluster coectio Examples IfiiBad (cot d) particularly efficiet (amog others) due to protocol offload ad reduced CPU utilisatio Remote Direct Memory Access (RDMA), i.e. direct R/W access via HCA to local/remote memory without CPU usage/iterrupts switchig: costat bisectio badwidth (up to 88 odes) CPU CPU memory cotroller memory HCA ode lik Switch TCA HCA 9

Examples Scalable Coheret Iterface (SCI) origiated as a offshoot from IEEE Futurebus project i 988 became IEEE stadard i 99 SCI is a high performace itercoect techology that coects up to, odes (both hosts ad devices) supports remote memory access for read/write (NUMA) uses packet switchig poit-to-poit commuicatio Examples Scalable Coheret Iterface (cot d) shared memory: SCI uses a -bit fixed addressig scheme upper bits: ode o which physical storage is located lower 8 bits: local physical address withi memory hece, ay physical memory locatio of the etire memory space ca be mapped ito a ode s local memory virtual address space P virtual address space P SCI cotroller moitors IO trasactios (memory) to assure cache coherece of all attached odes, i.e. all write accesses that ivalidate cache etries of other SCI modules are detected performace: up to GBps with latecies smaller tha s differet topologies such as rig or torus possible ode A mmap import export SCI address space mmap physical address space ode B