EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1

Outline Midterm 1: 2/26, in class 2-3:20pm 2

Example memory system performance (1) Memory Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) (bus frequency) Processor: 2 GHz, 1 word = 64 bits (8 bytes) Unit of data access = 1 word Cycle time = 0.5 ns (processor cycle) Assume 2-issue superscalar, single cycle processor, 2 Double Precision multiply-add (2 multipliers, 2 adders) FPUs M P 64 Peak performance of the processor = 2 4 = 8 GFlops/s (Raw compute power) Clock rate Total # of FP ops /cycle Peak performance can also be computed as 2 2 = 4 GFlops/s Clock rate 2 pipelines 3

Example memory system performance (2) Example: Inner product: a b = σ n i=1 a i b i (data in external memory ) 2 data fetches for each multiply, add Processor can do 8 mult or add/ns = 4 FP ops /cycle Note: issue bandwidth limits performance to 2 FP ops/cycle Read Read Mul Add 0 0.5 1 1.5 2 2.5 3 If we can stream the data, Read Read In the best case, i.e., data is streamed (no pipeline stalls): Sustained performance (possible best case) = 2 ops over 4 processor cycles = 1 GFlops/s Processor memory bandwidth determines possible best case sustained performance 4 3.5 4 4.5 5 ns Mul Add

Example memory system performance (3) M P 64 Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) Repeat 1. read a i 10 ns 2. read b i 10 ns 3. multiply (use register data) 4. add (use register data) n times Overlap Instr. 3,4 with Instr. 1,2 (Pipelining) Sustained performance (possible worst case) Total # of FP ops Total time 2n 20n 10 9 = 0.1 GFlops/s Memory access time 5

Example memory system performance (4) M P 64 10 ns 1 GHz 2 GHz 2 issue 2 FPU Inner product Processor Peak = 8 GFlops/s (Raw compute power) Processor Peak = 4 GFlops/s (Processor organization) (Best case) Sustained = 1 GFlops/s (Memory b/w) (Worst case) Sustained = 0.1 GFlops/s (Memory latency) Memory bound computation 6

Cache (1) Improving effective memory latency M 10 ns 1 GHz DRAM 4 GB 64 64 Why it works? Data reuse P 2 GHz 4 way superscalar C Exec. pipelines Accessing from DRAM is expensive (DRAM latency) Repeatedly use the data in cache (fast) Hit ratio: fraction of the memory references served by the cache Cache Small (16 MB) Fast (2 GHz) 7

Spatial locality If location i is referenced at time t, then locations near i are referenced in a small window of time following t Temporal locality Cache (3) Locality of references In a small window of time, repeated references to a small set of data items i 1 i i + 1 memory t 1 t 2 Time t Small set of data items referenced time 8

Example: bubble sort for i from 1 to N end for for j from 0 to N-2 R 1 R 2 a[j] a[j+1] if R 1 > R 2 swap (R 1 and R 2 ) end if a[j] R 1 a[j+1] R 2 end for Cache (4) i = 1 i = 2 9 Assumptions: Direct mapped First in first out policy Write through policy N = 8 Cache line size = 1 data element Cache size 2 data elements Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Read a[2], a[3] hit, miss Read a[6], a[7] hit, miss Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Cache hit ratio= Read hit ratio = # of times data found in cache Total # ofaccesses N 2 (N 1) 2 = 6 7 2 = 0.43

Data layout and data access pattern (1) Storing a 2-dimemsional array in memory 0 n-1 Row major order 0 n-1 Map Memory Column major order 0 n-1 0 n-1 Map Memory A(i, j) Memory(i n + j) 0 i, j < n A(i, j) Memory(i + n j) 0 i, j < n 10

Shuffle Network Perfect shuffle (PS) connection A link exists between input i and output j if: j = 2i, 0 i < p 2 p 2i + 1 p, i < p 2 Left rotation (circular left shift) of binary representation of i p = power of 2 000 001 010 0 1 2 0 1 2 000 = left_rotate(000) 001 = left_rotate(100) 010 = left_rotate(001) 011 3 3 011 = left_rotate(101) 100 101 110 4 5 6 4 5 6 100 = left_rotate(010) 101 = left_rotate(110) 110 = left_rotate(011) 111 7 7 111 = left_rotate(111) 11

Shuffle for n = 8 Shuffle Exchange Network 0 0 1 4 2 1 3 5 4 2 5 6 6 3 7 7 ii 2i2 i mod modnn ii (2i i + 1) mod mod nn 2i 2i 2i 2i+ 1 Exchange 0 i n 1 2 n i n 1 2 0 i n 1 2 12

Example: n=8 3 bit index Shuffle Connection Circular left shift 1 0 0 0 0 1 ( 4 1 ) Exchange connection 2i 2i + 1 Complement lsb 1 Diameter: (discussed later) O(log n) k n 2 13

Routing in Shuffle Exchange Network (1) Source x = x k 1 x 0 Destination d = d k 1 d 0 y x {current location} i 1 While i k End Shuffle y {Rotate left} Source x Compare LSB of y with bit (k i) of destination (d) If bits are the same, then do not Exchange; else Exchange {Complement y 0 } i i + 1 Total # of hops 2k (2log 2 n) Intermediate nodes Destination d 14

Routing in Shuffle Exchange Network (2) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 Example: i = 1 Shuffle 000 000 Compare LSB of y with bit 2 of destination y 0 = d 2? Same as x 2 = d 2? x = 000 d = 110 i = 1 i = 2 i = 3 000 S 001 E 010 S 011 E 110 S 110 No E Position at the end of first iteration: 001 End of i th iteration: y = x k 1 i x 0 d k 1 d k i 15

Routing in Shuffle Exchange Network (3) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 16

Routing in Shuffle Exchange Network (4) Theorem: In a shuffle exchange network with n = 2 k nodes, data from any source to any destination can be routed in at most 2log 2 n steps. 17

Omega Network (1) p input, p output log 2 p stages, each stage having p 2 2 2 switches Switch: 0 Shuffle Exchange 1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 18

Omega Network (2) Omega network properties Multistage network Cost ~ P log 2 2p (number of switches) Note: in actual hardware design, routing cost dominates! Omega network can do 2 (p 2 log 2p) < p! Permutations All p! Permutations can not be realized Unique (only one) path from any input to any output 19

Omega Network (3) Example of Blocking one of the messages (010 to 111 or 110 to 100) is blocked at link AB 000 001 010 011 B 000 001 010 011 100 101 110 111 A 100 101 110 111 20

Congestion in a Network (1) Given a routing protocol and data communication pattern (ex. permutation) Congestion = Max. { # of paths passing through the node } 0 2 4 Congestion = 2 1 3 21

Congestion in a Network (2) Interconnection Network = Graph + Routing algorithm Assume routing algorithm provides unique (exactly one path) communication from i to j for all i, j For a given permutation: Congestion at node k = # of paths that pass through k Congestion in the network = Max {# of paths that pass through k} over nodes K Max {congestion in the network} all permutations 22

CLOS network (2) Structure of CLOS network Control n n 23

CLOS network (3) Stage i Stage i + 1 connections Any Box all boxes in the next stage 3 stage network Number of switches (boxes) = 3 n Cost of n n crossbar = O n 2 Total Cost = O n n = O(n 3 2) Note: CLOS network can realize all n! permutations 24

n = 2 k for some k Butterfly network (1) 8-input butterfly network Stage 0 Stage 3 log 2 n + 1 stages Power of 2 connections 2 l, l = 0,1, log 2 n 1 Stage l, 0 l < log 2 n i i in Stage l + 1 i i complement bit l in Stage l + 1 l = 0 l = 1 l = 2 25

Butterfly network (2) n input butterfly network Total number of nodes = n log 2 n + 1 Total number of edges = n 2 log 2 n # of edges/node # of stages 26

Mesh-connected Network 1-D mesh 2-D mesh (without wraparound) Without wraparound 1-D torus With wraparound ring 0 1 p-1 0 1 p-1 p k-dimensional mesh: p 1/k p 1/k p 1/k k times Number of connections per node 2k p 27

Hypercube Network (1) 100 110 Existing edge Added edge 0 1 00 10 01 11 000 010 101 001 011 111 0-D hypercube 1-D hypercube 2-D hypercube 3-D hypercube 0100 0110 0000 0010 0101 0111 0001 0011 1100 1110 1000 1010 1101 1111 4-D hypercube 1001 1011 Construction of hypercube from hypercubes of lower dimension 28

Hypercube Network (2) In general, for k dimension hypercube: p = total number of nodes = 2 k Node i = i k 1 i k 2 i 0 k connections/node complement a bit of i Example: 001 000 010 100 000 100 101 010 110 111 001 011 29

Tree-based Network (1) Processing node Switching node Height: log p + 1 1 Static tree network Dynamic tree network p = total number of nodes 30

Tree-based Network (2) Fat tree (16 processing nodes) 31

Performance metrics (1) Diameter Diameter Maximum distance between any two processing nodes in the network diameter max{distance(i, j)} (i,j) distance(i, j) min path length between i and j length of the shortest path between i and j length of a path = # of edges in the path 32

Performance metrics (6) Example Bisection width (2) partition p p p for 2-D mesh (no wraparound) 33

Performance metrics (8) Cost of a static network Cost number of communication links in the network Example: Tree: p 1 1-D mesh (no wraparound): p 1 d-dimensional wraparound mesh: d p Hypercube: p log 2p 2 k-ary d-cube A d-dimensional array with k elements in each dimension Number of nodes p = k d Cost: dp 34

Summary Network Diameter Bisection width Cost (No. of links) Completely connected 1 p 2 4 p(p 1) 2 Star 2 1 p 1 Complete binary tree 2 log p + 1 2 1 p 1 1-D mesh, no wraparound p 1 1 p 1 2-D mesh, no wraparound 2( p 1) p 2(p p) 2-D wraparound mesh 2 p/2 2 p 2p Hypercube log p p/2 (p log p)/2 Wraparound k-ary d-cube p = k d d k/2 2k d 1 dp 35

Shared Address Space Programming (1) All threads have access to the same global, shared memory Threads can also have their own private data Programmer in responsible for synchronizing access (protecting) globally shared data (to ensure correctness of the program) Private data Thread Thread Private data Shared Memory Private data Thread Thread Private data 36

Shared Address Space Programming (5) Thread 1 k from 1 to n C 1,1 = C(1,1) + A 1, k B(k, 1) Example 1 Shared Memory C A B Thread i n + j k from 1 to n C i, j = C(i, j) + A i, k B(k, j) C(i, j) 37

Shared Address Space Programming (7) Thread 1 i from 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Example 2 Matrix Multiplication Shared Memory A C B Thread 2 i from 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) 1 3 2 4 Thread 3 i from n + 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Input data shared No interaction among threads Thread 4 i from n + 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) 38

Synchronization method Barrier (1) Barrier objects can be created at certain places in the program Any thread which reaches the barrier stops until all the threads have reached the barrier T1 T2 T3 T4 T1, T2, T3, T4 wait till all threads reach the barrier Barrier Time 39

Shared Variable Access (1) Threads acquire lock to modify a shared variable Release lock when done Only 1 thread can acquire a lock at any time Successful T1 T2 Unsuccessful Lock Access Shared Variable Lock Wait Lock T3 Release lock Lock Access Shared Variable Release lock Example execution sequence Wait Lock 40

Shared Variable Access (2) Example: Find max between two threads Each thread has a local value Initialize Max (Max is a shared variable), 0 Thread 1 Acquire_lock(Max) If (i > Max) Max = i Release_lock(Max) Thread 2 Acquire_lock(Max) If (j > Max) Max = j Release_lock(Max) 41

Correct Parallel Program Data Parallel Program Output Parallel Platform For all data inputs, for all execution sequences, correct output is produced 42

A Simple Model of Shared Address Space Parallel Machine (PRAM) (1) 1 unit of time = Local access shared memory access Synchronous model Parallel time = total number of cycles Pthreads programming model? Asynchronous shared memory?? Shared memory 0 p 1 p processors 43

Random Access Machine (RAM) Random Access Access to any memory location (Direct access) Time Access to memory 1 unit of time Arithmetic/Logic operation 1 unit of time Serial time complexity T s (n) Example: Merge Sort PRAM (2) T s (n) = O(n log n) 1 unit of time Processor Memory 44 2/21/2018 44

PRAM (5) PRAM is a synchronous model clock 0 Shared memory p 1 Each processor is a RAM Local program acting on data in shared memory 1 unit of time For all i, i-th instruction in the execution sequence is executed in the i-th cycle by all the processors 45 Synchronous execution

Adding on PRAM (1) Simple shared memory algorithm for adding n numbers Output = σ n 1 i=0 A i in A(0) A(0) A(n 1) 46

For n = 8 i = iteration # 0 1 7 10 11 12 13 14 15 16 17 Adding on PRAM (2) Key Idea end of i = 0 i = 1 i = 2 21 11 25 13 29 15 33 17 46 25 62 33 108 62 Active processors = Processor index = {0, 2, 4, 6} {0, 4} {0} {0, 1, 2, 3} 2 {0, 1} 4 {0} 8 47

Adding on PRAM (3) Example: n = 8 Active processors Processor # (j) 0 2 4 6 iteration # (i) 0 1 2 ~~~~~~~~~ ~~~~~~~~~ (time) Total # of additions performed = n 2 + n 4 + 1 Total # of additions performed = n 1 Total # of additions performed = Total # of additions performed by a serial program 48

Algorithm Program in processor j, 0 j n 1 1. 2. Do i = 0 to log 2 n 1 Adding on PRAM (4) If j = k 2 i+1, for some k N then A j A j + A(j + 2 i ) A(0) 3. end Note: A is shared among all the processors Synchronous operation [For ex. all the processors execute instruction 2 during the same cycle, log 2 n time] N = set of natural numbers = {0, 1, } Parallel time = O(log n) A(n 1) 49

Addition on PRAM, p < n processors 1. Add within block of size n p 2. Apply Algorithm 1 n/p T p = Parallel Time = O( n + log p) p T s = Serial Time = O(n) n p 50 50

Synchronous and Asynchronous Clock 0 y 0 1-500 501 502 503 Synchronous (e.g. PRAM) Do i = 0 to 499 y y + A(i) End y x + y OUTPUT y STOP 51 0 1-500 501 x 0 Do i = 500 to 999 x x + A(i) End STOP Asynchronous (e.g. Pthreads) y 0 Do i = 0 to 499 y y + A(i) End Barrier y x + y x 0 Do i = 500 to 999 x x + A(i) End Barrier

Addition: Pthreads model? Instruction execution NOT synchronized Thread j 1. 2. 3. 4. Do i = 0 to log 2 n 1 end If j = k 2 i+1, for same k N then A j A j + A(j + 2 i ) BARRIER Correct output? How to measure time? Number of active threads in iteration i = 2n 2 i+1 Complete each iteration before proceeding to next iteration Within each iteration threads may execute asynchronously 52

OpenMP Programming Model (2) Directive based parallel programming 1 #pragma omp directive [clause list] 2 /* structured block */ Structured block = a section of code which is grouped together OpenMP directives are not part of a programming language Can be included in various programming languages (e.g., C, C++, Fortran) 53

OpenMP Programming Model (3) Fork - Join model: Fork: The master thread creates a team of parallel threads Join: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread Fork Join Fork Join 54

OpenMP Directives (1) OpenMP executes serially until parallel directive parallel region construct Same code will be executed by multiple threads Fundamental OpenMP parallel construct Example #pragma omp parallel [clause list] { task(); } 55 Master thread Team thread Team Thread task() task() task() Clause list specifies parameters of parallel execution # of threads, private variables, shared variables Implied Barrier

OpenMP Directives (2) Work-Sharing Constructs Divide the execution of the enclosed code region among the members Implied barrier at the end of a work sharing construct Types of Work-Sharing Constructs: for - shares iterations of a loop across the team. Represents a type of "data parallelism". sections - breaks work into separate, discrete sections. Each section is executed by a thread. Represents a type of "functional parallelism". 56

Routing Mechanisms (1) 1. Store and Forward Routing Message length = m words Number of hops = l P 0 P 1 P 2 P 3 Example: l = 3 P 0 P 1 t s (t h + m t w ) source destination Store: each intermediate node receives the entire message from its predecessor Forward: forward to the next node P 2 P 3 time Total time for communication = t s + (t h + m t w ) l 57

Cut Through Routing (3) 1 2 l t comm = t s + l t h + t w m Pipeline delay Communication time on a link Pipelined Data Communication Note: There is a small overhead in processing FLIT at each intermediate node, smaller than processing a packet. t h is smaller than store and forward scenario. 58

Cut Through Routing (7) Minimizing l (number of hops) is hard Program has little control in program to processor mapping Machines (platform) use various routing techniques to minimize congestion (ex: randomized routing) Per hop time is usually relatively small compared with t s and t w m for large m Simple communication cost model: t comm = t s + t w m Same amount of time to communicate between any two nodes (fully connected?) Use this cost model to design and optimize instead of parallel architecture (mesh, hypercube) specific algorithms Simplified cost model does not take congestion into account 59

LogP Model (3) P processors P-M latency P-M throughput M M M o overhead P P P g gap L (latency) Communication Network 60

LogP Model (4) L: an upper bound on the latency, or delay, incurred in communicating a message containing a word from its source module to its target module o: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message g: the gap, defined as the min. time interval between consecutive message transmission or reception of each message P: the number of processors/memory modules 61

Program and Data Mapping (1) Parallel Program = Collection of Processes + Interaction among Processes Each process = Computation + Data access Classic problem: Given a parallel program, embed (map) processes and data onto a given parallel architecture such that communication cost is minimized overall execution time is minimized 62

Program and Data Mapping (2) Parallel Program G (V, E) A simple abstraction Map s Parallel Architecture G (V, E ) Node - process Edge - communication Node - processor Edge - interconnection Graph embedding problem 63

Program and Data Mapping (3) Graph embedding problem Given parallel program and parallel architecture Given parallel program and parallel architecture G(V, E) G (V, E ) undirected b f(b) Function f: V V a f(a) Each edge in E specifies a path in G Function g: E paths in G For every edge (a, b) in E, g specifies a path in G, such that the end points in G are f(a) and f(b) Note: 1. V can be larger than V 2. A vertex in V may correspond to more than one vertex in V 64

Program and Data Mapping (5) Metrics Congestion: Max number of edges in E mapped to an edge in E (on edges) Dilation: Edge in E Path in E Max path length in E Expansion: V / V It is also possible V < V for example: virtualization 65

Scalability (2) Speedup = Serial time (on a uniprocessor system) Parallel time using p processors If speedup = O(p), then it is a scalable solution. 66

Amdahl s Law (1) Amdahl s Law Limit on speedup achievable when a program is run on a parallel machine Given an input program Total execution time: S + P S 1 Serial Portion Parallelizable Portion P Time on a uni-processor machine: 1 = S + P Time on a parallel machine: S + P/f f = speedup factor 67

Scaled Speedup (Gustafson s Law) (2) Amdahl s Law : Fixed amount of computations 1 Serial Time S 1 S Parallel Time S 1 S p p : Number of Processors Gustafson s Law : Increase p and amount of computations 1 Parallel Time S P = 1 S Serial Time S (1 S ) p If parallelism scales linearly with p, number of processors Serial time Scaled Speedup = = = Parallel time 68 S + 1 S p S + 1 S p S + P 1

Performance (1) Efficiency Question: If we use p processors, is speedup = p? Efficiency = Fraction of time a processor is usefully employed during the computation p=2 p=1 Typical execution of a program on a parallel machine P 1 P 2 Serial time 0.5 useful work 1 idle (overhead) E = Speedup / # of processors used E is the average efficiency over all the processors Efficiency of each processor can be different from the average value Serial Computation Time 69

Performance (3) Cost = Total amount of work done by a parallel system = Parallel Execution Time x Number of Processors = T p p Cost is also called Processor Time Product COST OPTIMAL (or WORK OPTIMAL) Parallel Algorithm Total work done = Serial Complexity of the problem 70

Performance Analysis (1) Asymptotic Analysis Big O Notation or Order Notation Worst case execution time of an algorithm Upper bound on the growth rate of the execution time Example: n n matrix multiplication 1. Do i 2. Do j 3. C(i, j) 0 4. Do k = 1 to n 5. C(i, j) C(i, j) + A(i, k) B(k, j) 6. End 7. End 8. End T(n) = time complexity function = n 2 + n 3 + n 3 Line 3 Line 5 71