Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors Prof. Robert va Egele

Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies for message-passig multicomputers Distributed shared memory Pipelie ad vector processors Compariso Taxoomies

PMS Architecture Model A simple PMS model Processor (P) A device that performs operatios o data Memory (M) A device that stores data Switch (S) A device that facilitates trasfer of data betwee devices Arcs deote coectivity A example computer system with CPU ad peripherals Each compoet has differet performace characteristics

Shared Memory Multiprocessor Processors access shared memory via a commo switch, e.g. a bus Problem: a sigle bus results i a bottleeck Shared memory has a sigle address space Architecture sometimes referred to as a dace hall

Shared Memory: the Bus Cotetio Problem Each processor competes for access to shared memory Fetchig istructios Loadig ad storig data Performace of a sigle bus S: bus cotetio Access to memory is restricted to oe processor at a time This limits the speedup ad scalability with respect to the umber of processors Assume that each istructio requires 0<m<1 memory operatios (the average fractio of loads or stores per istructio), F istructios are performed per uit of time, ad a maximum of W words ca be moved over the bus per uit of time, the S P < W / (m F) regardless of the umber of processors P I other words, the parallel efficiecy is limited uless P < W / (m F)

Shared Memory: Work-Memory Ratio for (i=0; i<1000; i++) x = x + i; 2 distict memory locatios ad oe float add: FP:M = 500 for (i=0; i<n; i++) x = x + a[i]*b[i]; 2N+1 distict memory locatios ad 2N FP operatios: FP:M = 1 whe N is large Work-memory ratio (FP:M ratio): ratio of the umber of floatig poit operatios to the umber of distict memory locatios refereced i the iermost loop: Same locatio is couted just oce i iermost loop Assumes effective use of registers (ad cache) i iermost loop Assumes o reuse across outer loops (registers/cache use saturated i ier loop) Note that FP:M = m -1 1 so efficiet utilizatio of shared memory multiprocessors requires P < (FP:M+1) W / F

Shared Memory Multiprocessor with Local Cache Add local cache the improve performace whe W / F is small With today s systems we have W / F << 1 Problem: how to esure cache coherece?

Shared Memory: Cache Coherece Thread 1 modifies shared data A cache coherece protocol esures that processors obtai ewly altered data whe shared data is modified by aother processor Because caches operate o cache lies, more data tha the shared object aloe ca be effected, which may lead to false sharig Thread 0 reads modified shared data Cache coherece protocol esures that thread 0 obtais the ewly altered data

COMA Cache-oly memory architecture (COMA) Large cache per processor to replace shared memory A data item is either i oe cache (o-shared) or i multiple caches (shared) Switch icludes a egie that provides a sigle global address space ad esures cache coherece

Distributed Memory Multicomputer Massively parallel processor (MPP) systems with P > 1000 Commuicatio via message passig Nouiform memory access (NUMA) Network topologies Mesh Hypercube Cross-bar switch

Distributed Shared Memory Distributed shared memory (DSM) systems use physically distributed memory modules ad a global address space that gives the illusio of shared virtual memory that is usually NUMA Hardware is used to automatically traslate a memory address ito a local address or a remote memory address (via message passig) Software approaches add a programmig layer to simplify access to shared objects (hidig the message passig commuicatios)

Computatio-Commuicatio P=8 P=4 t comp / t comm = 1000 / 10 2 Ratio The computatiocommuicatio ratio: t comp / t comm Usually assessed aalytically ad verified empirically High commuicatio overhead decreases speedup, so ratio should be as high as possible For example: data size, umber of processors P, ad ratio t comp / t comm = 1000 / 10 2 P=2 P=1 S P = t s / t P = 1000 / (1000 / P + 10 2 )

Mesh Topology Network of P odes has mesh size ÖP ÖP Diameter 2 (ÖP -1) Torus etwork wraps the eds Diameter ÖP -1

Hypercube Topology d-dimesioal hypercube has P = 2 d odes Diameter is d = log 2 P d=2 d=3 Node addressig is simple Node umber of earest eighbor ode differs i oe bit Routig algorithm flips bits to determie possible paths, e.g. from ode 001 to 111 has two shortest paths 001 011 111 d=4 001 101 111

Cross-bar Switches Processors ad memories are coected by a set of switches Eables simultaeous (cotetio free) commuicatio betwee processor i ad s(i), where s is a arbitrary permutatio of 1 P Cross-bar switch s(1)=2 s(2)=1 s(3)=3

Multistage Itercoect Network 4 4 two-stage itercoect 8 8 three-stage itercoect Each switch has a upper output (0) ad a lower output (1) A message travels through switch based o destiatio address Each bit i destiatio address is used to cotrol a switch from start to destiatio For example, from 001 to 100 First switch selects lower output (1) Secod switch selects upper output (0) Third switch selects upper output (0) Cotetio ca occur whe two messages are routed through the same switch

Pipelie ad Vector Processors DO i = 0,9999 z(i) = x(i) + y(i) DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999) Vector processors ru operatios o multiple data elemets simultaeously Vector processor has a maximum vector legth, e.g. 512 Strip miig the loop results i a outer loop with stride 512 to eable vectorizatio of loger vector operatios Pipelied vector architectures dispatch multiple vector operatios per clock cycle Vector chaiig allows the result of a previous vector operatio to be directly fed ito the ext operatio i the pipelie

Compariso: Badwidth, Latecy ad Capacity

Further Readig [PP2] pages 13-26 [SPC] pages 71-95 [] pages 25-28