Cache Coherency and Interconnection Networks

Size: px

Start display at page:

Download "Cache Coherency and Interconnection Networks"

Theodore Dorsey
6 years ago
Views:

1 Cache Coherency and Interconnection Networks Cluster and Grid Computing Autumn Semester ( ) 7 th August 2006 Umang Jain Kumar Puspesh Pankaj Jajoo Amar Kumar Dani 03CS CS CS CS304

2 CACHE COHERENCY

3 The Cache Coherence Problem Caches allow greater performance by storing frequently used data in faster memory Since all processors share the same address space, more than one caches may have a copy of the same block of data If one processor updates the data item without informing the other processor, inconsistencies may result and cause incorrect executions

4 The Cache Coherence Problem For correct execution, coherence must be enforced between the caches Primary design issues are: Coherence detection strategy (or incoherence shall we say!) coherence enforcement strategy

5 Enforcement Strategies Write Invalidate Strategy All other caches now effectively do not contain the data block X ->X X ->X X ->I X ->I

6 Enforcement Strategies Write Update Strategy All other caches are updated X ->?? X ->X X ->X X ->X

7 Cache Coherence Protocols For invalidation or updation consistency commands have to be issued to the various processor caches Two options: - Broadcast the messages for all to listen - Multicast commands only to those caches having a copy of the data block in question

8 Snoopy Cache Protocol All caches snoop on the bus for all the consistency commands MEMORY Shared Bus Cache P Cache P Cache P

9 Snoopy Cache Protocol Snooping protocols rely on a shared bus between the processors for coherence On a processor write, the write is passed through the cache to main memory on the bus and invalidation or updation commands are broadcast on the bus Cache-controller of any processor caching that address may update or invalidate its cache entry as appropriate

10 Directory Based Schemes Consistency commands issued only to those caches having a copy of the data block (book-keeping required) M DIR M DIR M DIR Cache P Cache P Cache P

11 Directory Based Schemes Central directory may contain copies of local cache directories each of them containing the state information for the different blocks Presence flag vector associated with each memory block - has one bit for each cache (eliminates requirements for an exhaustive search) Flag vector and state information along with identity of the current owner may be stored locally as well.. This will reduce the directory contention problem

12 Write Invalidate :An Example We can have states associated with each block of data in a cache Invalid Inconsistent Valid Hasn t been updated locally-consistent Reserved- updated locally once consistent only with the memory copy Dirty consistent with none, it is the only updated copy

13 Write Invalidate :An Example Each cache controller executes a simple FSM switching states on receiving messages Commands-: mem_rd, mem_wr, p_rd, p_wr, wr_inv, rd_inv Cases-: Read miss read from memory (valid) Write hit update and all others invalidated Write miss read and updated all others invalid Replacement If state is dirty then write back

14 Write Invalidate :An Example X Y Wr_inv P X X X Y I I All VALID RESERVED INVALID

15 Write Invalidate :An Example Y Z Mem_rd P2 P_wr P Z I I Z Z I DIRTY VALID INVALID INVALID

16 Write Invalidate :An Example Y Y rd_inv P3 P_wr P3 Y I I->Y I I Y->Z RESERVED INVALID DIRTY INVALID

17 Snoopy vs Directory Schemes Snoopy protocols not suited for general topologies As the number of processors increase bus-traffic begins to pose serious problems Directory based schemes much better for large number of processors They are more scalable and have much reduced bustraffic Sacrifices ease of implementation due to increased hardware complexity

18 Cache Coherent Network Architectures Hierarchical Bus/Cache Architecture M M M Cache Cache Cache P/C P/C P/C P/C P/C P/C P/C P/C P/C

19 INTERCONNECTION NETWORKS

20 Interconnection Networks Tree Mesh Hypercube Tree of Meshes Mesh of Trees Fat-Trees 2D-Torus

21 TREE TOPOLOGY

22 Trees A general Purpose Topology Advantage :-. Easy to Implement 2. For any irregular topology, easy to define Tree that spans the whole Graph Disadvantage :-. Root and the nodes close to it become a Bottleneck.

23 Trees (contd.) Binary Tree Networks (a) Static (b) Dynamic

24 Trees (contd.) Diameter for static trees, for dynamic trees, Bisection Width Clearly, it is equal to. d = 2log((p+)/2) d = 2logp where p = total number of nodes Degree Degree is, 2 or 3. (for binary trees)

25 Trees (contd.) Parallel Algorithm for Matrix-Vector Multiplication a a 2 a 3 a 4 a 2 a 22 a 23 a 24 a 3 a 32 a 33 a 34 u u 2 u 3 u 4 v v 2 v 3 Or, we can write it as, v i = n Σ a ij x u j, <= i <=m j=

26 Trees (contd.) v P 7 v 2 v 3 P 5 P 6 P P P 3 P 2 4 u u 2 u 3 u 4 a a 2 a 3 a 4 a 2 a 22 a 23 a 24 a 3 a 32 a 33 a 34

27 Trees (contd.) Example A = 2 and U =

28 Trees (contd.) procedure TREE_MULTIPLICATION (A, U, V) do steps and 2 in parallel () for i = to n do in parallel // for leaf nodes for j = to m do (.) compute u i x a ji (.2) send result to parent end for end for (2) for i = n+ to 2n- do in parallel // for intermediate nodes while P i receives two inputs do (2.) compute sum of both inputs (2.2) if i < 2n- then send result to parent else produce result as output end if end while end for.

29 Trees (contd.) Analysis :. It takes log n steps for v to emerge from root after first row of A has entered at leaves. 2. After m- steps, v m emerges from the root. Hence, TREE_MULTIPLICATION takes (m + log n) steps. Cost is O(n 2 ) when m < n

30 Trees (contd.) Due to Heavy Traffic trough the root node, the links near the root become Bottleneck. So, some modifications are made in the standard Tree Networks : Fat-Tree

31 MESH TOPOLOGY

32 Mesh Array ( D Mesh) Ring ( D Torus) 2-D Mesh 2-D Torus

33 Mesh (Contd.) Properties : No. of Processors = k*k Distance : to 2*k-2 Diameter : 2*k 2 Degree : 2 to 4 Bisection Width : If k is even : k A two-dimensional Mesh If k is odd : k+

34 Torus Properties : No. of Processors = k*k Distance : to k Diameter : If k is even : k If k is odd : k+ Degree : 4 Bisection Width : 2*k or 2*k+2 A two-dimensional Torus

35 Mesh (Contd.) Matrix Multiplication :

36 Mesh (Contd.) A = B = A X B =

37 Mesh (Contd.)

38 Mesh (Contd.)

39 Mesh (Contd.)

40 Mesh (Contd.)

41 Mesh (Contd.)

42 Mesh (Contd.)

43 Mesh (Contd.)

44 Mesh (Contd.)

45 Mesh (Contd.) Matrix Multiplication Procedure matrix_multiplication for each processor P ij in parallel do c:= 0 end for repeat 3*N /2-2 times for each processor P ij in parallel receive a from top receive b from left do c := c + a*b send a down send b to right end for end repeat

46 Mesh (Contd.) Analysis : For an m*m matrix,total computation time required is 3m-2 steps. Each step takes constant time for multiplication and transfer operation. Thus time complexity is O(m). Cost = O(m 3 )

47 Mesh (Contd.) Advantages :-. There are multiple paths between any two nodes, so the network is tolerant of failure of specific node. 2. The topology supports many simultaneous messages due to multiplicity of paths. 3. The growth complexity is 2*N /2 + and there is no need of change of hardware of existing nodes. 4. Advantageous for problems involving calculations in n-dimension, for example, image processing, finite element analysis, etc.

48 Mesh (Contd.) Disadvantages :-. Large diameter, 2*N / It is a non-uniform topology i.e. there is a range of degrees, thus the complexity of routing algorithm increases.

49 Combination Mesh of Trees Tree of Meshes (Quite similar to Fat-Trees)

50 HYPERCUBE TOPOLOGY

51 Construction 0 0 D 2D D 4D

52 Routing Algorithm Each node is given a node ID An N-dimensional cube will have N-bit node IDs Sending a message from node A to node B can be done in at most n cycles On cycle i the node holding the message compares bit i of its own ID with that of destination ID If the bit matches the node holds the message If the bit does not match it forwards the message along dimension i

53 Properties Advantage:-. For a hypercube with 2 d nodes number of steps to send message to any node is at max D 2. Hypercube topology is highly scalable and node symmetric Disadvantage:-. Difficult to implement 2. Cannot be scaled up to include arbitrary number of computers

54 Metrics Diameter = ln(n), N = 2 n, No of nodes Bisection Width = N/2 Cost = No of links = N/2 ln(n) Degree = ln(n)

55 Topological Properties useful for parallel algorithms Recursive structure of hypercube make them ideal for recursive and divide and conquer type problems Existence of multiple node-disjoint and edgedisjoint paths between many pairs of nodes in a hypercube

56 Algorithm for Hypercube topology Matrix Multiplication : To multiply two m*m matrices, p = m 3 = 2 q processors are required Each processor has 3 registers : R (a),r (b) and R (c) Each processor is labeled by 3 indices i, j, k where each index is a q/3 bit binary number Initially, Processor (0, j, k) holds A (j,k) and B (j, k) in its R (a) and R (b) registers. At the end of the computation, Register Rc of processor (0,j,k) will hold element C (j, k) of the product matrix C.

57 Example: A = 2 B = C (Expected) = A*B =

58 Example : ,5 2,6 3,7 4,8 00 0,5 2,6 3,7 4,8 2,5 2,6 2,7 2,8,5,6,5,6 4,7 4,8 4,7 4,8 3,7 3,8 3,5 3,6

59 Example (contd..) R(C) = R(A)*R(B)

60 Algorithm:. for l = q/3 downto 0, Processor x = ijk, 0 i, j, k < m { if bit l of i is { R A [x] : = R A [N l+2q/3 (x)] R B [x] : = R B [N l+2q/3 (x)] } } 2. for l = q/3 downto 0, Processor x = ijk, 0 i, j, k < m { if bit l of i and k are different { R A [x] : = R A [N l (x)] } }

61 Algorithm (contd..) 3. for l = q/3 downto 0, Processor x = ijk, 0 i, j, k < m } { if bit l of i and j are different { R B [x] : = R B [N l+q/3 (x)] } 4. Processors x, 0 x < p, do R( C ) := R( A ) R( B ) { p = m ³ parallel multiplications in one step } 5. for l = q/3 downto 0, Processor x = ijk, 0 i, j, k < m { } if bit l of i is 0 { Rc [x] : = Rc[N l+q/3 (x)] + Rc [x] }

62 Algorithm Complexity Analysis : For an m*m matrix,total computation time required is 4(q/3 ) steps, where q = ln(p) = 3ln(m) Each step takes constant time for multiplication and transfer operation. Thus time complexity is O(q) or O(ln(m)) Cost = O(m 3 ln(m)

63 Comparison of properties of Tree, Mesh and Hypercube Topologies Property Tree Mesh Hypercube Diameter 2log((p+)/2) or 2log(p) 2*p /2-2 ln(p) Bisection Width p /2 (p=even) p /2 +(p=odd) p/2 Degree,2 or 3 2,3 or 4 ln(p)

64 Comparison of Algorithm complexity for matrix multiplication Number of Processors : Time Complexity: Cost : Mesh : m 2 Hypercube: m 3 Mesh : O(m) Hypercube: O(ln(m)) Mesh : O(m 3 ) Hypercube: O(m 3 )

65 Thank You!

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate