EE/CSCI 451: Parallel and Distributed Computation

Size: px

Start display at page:

Download "EE/CSCI 451: Parallel and Distributed Computation"

Felix Barnett
6 years ago
Views:

1 EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian University of Southern California 1

2 Announcements PA #1 due on Jan 26 th (Friday) HW #1 out Today, due on Feb 7 th 2

3 From last class Memory Systems Today Outline Latency, bandwidth Impact of cache on program performance Control structure Programming models Shared memory Message passing Interconnection networks (Chap ) 3

4 Control structure of parallel platforms (1) SIMD (single instruction multiple data) instructions 1. for i = 0 to n 1 do 2. C i A i + B(i) control 3. end Lockstep operation PE 0 PE 1 PE n-1 Low synchronization overhead Execute instruction j (on all data) Execute instruction j + 1 Interconnection network Example: Memory GPU execution model 4

5 Control structure of parallel platforms (2) MIMD (multiple instruction multiple data) Interconnection network T 0 Task (serial code) M M M M T 1 T 2 P P P P P 0 P 1 P 2 P 3 T 3 Example mapping: T i P i Processors need not operate in lockstep; each processor can have its own clock Synchronization overhead can be high 5

6 Data exchange (communication) model (1) Shared address space Variables shared across the programs running on various processors Communication between processors: use shared variables Shared address space Logical space P 0 P 1 P 2 P 3 6

7 Data exchange (communication) model (2) Example Read 1000 values and output sum Host: read A i, 0 i < 1000; Flag 0 P 0 1. x 0 2. for i = 0 to x x + A(i) 4. end for 5. If Flag = 0 6. wait 7. output x + y 1. y 0 P 1 2. for i = 500 to y y + A(i) 4. end for 5. Flag 1 Example programming models: Pthreads, OpenMP A is shared; Flag is shared; x, y are shared M Interconnection network P 0 P 1 7

8 Data exchange (communication) model (3) Realizing shared address space Global memory + interconnection network Uniform Memory Access Distribute the shared address space across the processors Memory Interconnect. network Interconnection network M M M M P 0 P n-1 P P P P NUMA (Non-Uniform Memory Access) 8

9 Data exchange (communication) model (4) Message passing model Interconnection All data local to each processor P 1 P 0 P n-1 Var x Var x Interaction via explicit message passing 9

10 Data exchange (communication) model (5) Example: Read 1000 values and output sum P 0 1. Read A i, 0 i < x 0 3. for i = 0 to x x + A(i) 5. end for 6. Receive y from P 1 7. Output x + y message P 1 1. Read A i, 500 i < x 0 3. for i = 500 to x x + A(i) 5. end for 6. Send x to P 0 Two different variables M Interconnection network P 0 P 1 Data is explicitly partitioned, allocated to programs (processors) Explicit communication and synchronization (P 0 has to wait for P 1 in Step 6) Example: MPI (Message Passing Interface) 10

11 Interconnection Networks Data communication among processors and memory Static (direct) networks Dynamic (indirect) networks 0/1 P 0 P 1 P 0 m 0 P 3 P 2 P 1 m 1 Direct connections (fixed) P 2 P 3 m 2 m 3 11

12 Network Topologies Examples: A static network of four processing elements or nodes A dynamic network of four nodes connected via a network of switches to other nodes 12

13 Bus-based Network Memory P 0 P 1 P n 1 Distance: O(1) Low cost Scalable? One transaction at any time Bandwidth = bus width (bits) clock rate (independent of n) Traffic on the bus can be reduced by using a cache in each node 13

14 Crossbar Network P 0 M 0 M 1 M i A switching element P 1 P i p p crossbar: cost ~ O(p 2 ) Number of switches To make a connection from P i to M j (permutation on p items): P i broadcast j switch (i, j) close connection 14

Shuffle Network Perfect shuffle (PS) connection A link exists between input i and output j if: j = 2i, 0 i < p 2 p 2i + 1 p, i < p 2 Left rotation (circular left shift) of binary representation of i

15 Shuffle Network Perfect shuffle (PS) connection A link exists between input i and output j if: j = 2i, 0 i < p 2 p 2i + 1 p, i < p 2 Left rotation (circular left shift) of binary representation of i p = power of = left_rotate(000) 001 = left_rotate(100) 010 = left_rotate(001) = left_rotate(101) = left_rotate(010) 101 = left_rotate(110) 110 = left_rotate(011) = left_rotate(111) 15

16 Shuffle for n = 8 Shuffle Exchange Network ii 2i2 i mod modnn ii (2i i + 1) mod mod nn 2i 2i 2i 2i+ 1 Exchange 0 i n 1 2 n i n i n

17 Example: n=8 3 bit index Shuffle Connection Circular left shift ( 4 1 ) Exchange connection 2i 2i + 1 Complement lsb 1 Diameter: (discussed later) O(log n) k n 2 17

18 Routing in Shuffle Exchange Network (1) Source x = x k 1 x 0 Destination d = d k 1 d 0 y x {current location} i 1 While i k End Shuffle y {Rotate left} Source x Compare LSB of y with bit (k i) of destination (d) If bits are the same, then do not Exchange; else Exchange {Complement y 0 } i i + 1 Total # of hops 2k (2log 2 n) Intermediate nodes Destination d 18

19 Routing in Shuffle Exchange Network (2) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 Example: i = 1 Shuffle Compare LSB of y with bit 2 of destination y 0 = d 2? Same as x 2 = d 2? x = 000 d = 110 i = 1 i = 2 i = S 001 E 010 S 011 E 110 S 110 No E Position at the end of first iteration: 001 End of i th iteration: y = x k 1 i x 0 d k 1 d k i 19

20 Routing in Shuffle Exchange Network (3) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 20

21 Routing in Shuffle Exchange Network (4) Theorem: In a shuffle exchange network with n = 2 k nodes, data from any source to any destination can be routed in at most 2log 2 n steps. 21

22 Multistage Network (1) Can realize rich set of connecting patterns from input to output Connections 22

23 Multistage Network (2) Connecting Pattern n inputs / n outputs For each i, 0 i < n, output j to be routed to (or data to be routed to) Permutation Given n inputs and n outputs, total # of connection patterns = n! Example 23

24 Multistage Network (3) Cost of a network Control Bit 2 1 MUX 2 2 switch Cost of a network: Total no. of 2 2 switches Note: wiring cost ignored

25 Multistage Network (4) CROSS BAR Switch Control Example implementation using MUX Control Bits 0 nlog n bits 0 Input Ports 0 1 MUX Output Ports 0 1 n 1 n 1 All n! permutations can be realized COST # of logic gates # of 2 1 MUX = O(n [n ]) 2 3 expensive MUX MUX

26 Routing (1) Example of Multistage Network No. of stages delay 3 stages No. of switches n (number of stages) 2 k combinations All 4! Permutations can be realized 27

27 k stage, n - input network Routing (2) k stage, n input network Total number of switches: n k 2 Total number of control bits: n k 2 Control bits specify a configuration of the network» Configuration permutation from input to output Total number of permutations that can be realized: 2 nk/2 If we want all n! permutations to be realized: 2 nk/2 n! k = no. of stages = Ω(log n) 28

28 ... Permutation n Switches 2 Permutation n Switches 2... k stages n switches in each stage 2 Routing (3) Multistage network Switch: delay = k 1 0 Stage 0 Stage k 1 0 n 1... n 2 bit control input 29

29 Omega Network (1) p input, p output log 2 p stages, each stage having p switches Switch: 0 Shuffle Exchange

30 Omega Network (2) Omega network properties Multistage network Cost ~ P log 2 2p (number of switches) Note: in actual hardware design, routing cost dominates! Omega network can do 2 (p 2 log 2p) < p! Permutations All p! Permutations can not be realized Unique (only one) path from any input to any output 31

31 Omega Network (3) Example of Blocking one of the messages (010 to 111 or 110 to 100) is blocked at link AB B A

32 Congestion in a Network (1) Given a routing protocol and data communication pattern (ex. permutation) Congestion = Max. { # of paths passing through the node } Congestion =

33 Congestion in a Network (2) Interconnection Network = Graph + Routing algorithm Assume routing algorithm provides unique (exactly one path) communication from i to j for all i, j For a given permutation: Congestion at node k = # of paths that pass through k Congestion in the network = Max {# of paths that pass through k} over nodes K Max {congestion in the network} all permutations 35

34 Summary Control structure Programming models Shared memory Message passing Interconnection networks Shuffle exchange network Multistage network 36

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1: