Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1

Size: px
Start display at page:

Download "Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1"

Transcription

1 Part II Shared-Memory Parallelism Part II Shared-Memory Parallelism Shared-Memory Algorithms Implementations and Models 5 PRAM and Basic Algorithms 6A More Shared-Memory Algorithms 6B Implementation of Shared Memory 6C, Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 999, ISBN ) It was prepared by the author in connection with teaching the graduate-level course ECE 54B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara Instructors can use these slides in classroom teaching and for other educational purposes Any other use is strictly prohibited Behrooz Parhami Edition Released Revised Revised Revised First Spring 5 Spring 6 Fall 8 Fall Winter Winter 4 Winter 6 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

3 II Shared-Memory Parallelism Shared memory is the most intuitive parallel user interface: Abstract SM (PRAM); ignores implementation issues Implementation w/o worsening the memory bottleneck Shared-memory models and their performance impact Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6A More Shared-Memory Algorithms Chapter 6B Implementation of Shared Memory Chapter 6C Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

4 5 PRAM and Basic Algorithms PRAM, a natural extension of RAM (random-access machine): Present definitions of model and its various submodels Develop algorithms for key building-block computations Topics in This Chapter 5 PRAM Submodels and Assumptions 5 Data Broadcasting 5 Semigroup or Fan-in Computation 54 Parallel Prefix Computation 55 Ranking the Elements of a Linked List 56 Matrix Multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

5 Why Start with Shared Memory? Study one extreme of parallel computation models: Abstract SM (PRAM); ignores implementation issues This abstract model is either realized or emulated In the latter case, benefits are similar to those of HLLs In Part II, we will study the other extreme case of models: Concrete circuit model; incorporates hardware details Allows explicit latency/area/energy trade-offs Facilitates theoretical studies of speed-up limits Everything else falls between these two extremes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

6 Processors 5 PRAM Submodels and Assumptions p Shared Memory Fig 46 Conceptual view of a parallel random-access machine (PRAM) m Processor i can do the following in three phases of one cycle: Fetch a value from address s i in shared memory Perform computations on data held in local registers Store a value into address d i in shared memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

7 Types of PRAM Reads from same location Exclusive Concurrent Writes to same location Exclusive Concurrent EREW Least powerful, most realistic ERCW Not useful CREW Default CRCW Most powerful, further subdivided Fig 5 Submodels of the PRAM model Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

8 Types of CRCW PRAM CRCW submodels are distinguished by the way they treat multiple writes: Undefined: Detecting: Common: Random: Priority: Max / Min: Reduction: The value written is undefined (CRCW-U) A special code for detected collision is written (CRCW-D) Allowed only if they all store the same value (CRCW-C) [ This is sometimes called the consistent-write submodel ] The value is randomly chosen from those offered (CRCW-R) The processor with the lowest index succeeds (CRCW-P) The largest / smallest of the values is written (CRCW-M) The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

9 Power of CRCW PRAM Submodels Model U is more powerful than model V if T U (n)=o(t V (n)) for some problem EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P Theorem 5: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list Sort the list of addresses in ascending order Remove all duplicate addresses Access data at desired addresses Replicate data via parallel prefix computation Each of these steps requires constant or O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

10 Implications of the CRCW Hierarchy of Submodels EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown Efficient parallel algorithms have polylogarithmic running times Running time still polylogarithmic after slowdown due to emulation We need not be too concerned with the CRCW submodel used Simply use whichever submodel is most natural or convenient Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

11 Some Elementary PRAM Computations Initializing an n-vector (base address = B) to all s: p elements for j = to n/p processor i do if jp + i < n then M[B + jp + i] := endfor n / p segments Adding two n-vectors and storing the results in a third (base addresses B, B, B) Convolution of two n-vectors: W k = i+j=k U i V j (base addresses B W, B U, B V ) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

12 5 Data Broadcasting Making p copies of B[] by recursive doubling for k = to log p Proc j, j < p, do Copy B[j] into B[j + k ] endfor Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded B Fig 5 Data broadcasting in EREW PRAM via recursive doubling B Fig 5 EREW PRAM data broadcasting without redundant copying Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

13 Class Participation: Broadcast-Based Sorting Each person write down an arbitrary nonnegative integer with or fewer digits on a piece of paper Students take turn broadcasting their numbers by calling them out aloud Each student puts an X on paper for every number called out that is smaller than his/her own number, or is equal but was called out before the student s own value j p Each student counts the number of Xs on paper to determine the rank of his/her number Students call out their numbers in order of the computed rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

14 All-to-All Broadcasting on EREW PRAM EREW PRAM algorithm for all-to-all broadcasting Processor j, j < p, write own data value into B[j] for k = to p Processor j, j < p, do Read the data value in B[(j + k) mod p] endfor This O(p)-step algorithm is time-optimal j p Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, j < p, write into R[ j ] for k = to p Processor j, j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + endif endfor Processor j, j < p, write S[ j ] into S[R[ j ]] This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

15 5 Semigroup or Fan-in Computation EREW PRAM semigroup computation algorithm Proc j, j < p, copy X[j] into S[j] s := while s <p Proc j, j < p s, do S[j + s] := S[j] S[j + s] s := s endwhile Broadcast S[p ] to all processors This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not If we use p processors on a list of size n = O(p log p), then optimal speedup can be achieved S Fig 54 Semigroup computation in EREW PRAM Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5 : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 55 Intuitive justification of why parallel slack helps improve the efficiency : : : : :4 :5 :6 :7 :8 :9 Lower degree of parallelism near the root Higher degree of parallelism near the leave

16 54 Parallel Prefix Computation Same as the first part of semigroup computation (no final broadcasting) S : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 56 Parallel prefix computation in EREW PRAM via recursive doubling : : : : :4 :5 :6 :7 :8 :9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

17 A Divide-and-Conquer Parallel-Prefix Algorithm x x x x x n- x n- : : : : Parallel prefix computation of size n/ Each vertical line represents a location in shared memory :n- :n- T(p) = T(p/) + T(p) log p In hardware, this is the basis for Brent-Kung carry-lookahead adder Fig 57 Parallel prefix computation using a divide-and-conquer scheme Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

18 Another Divide-and-Conquer Algorithm x x x x x n- x n- Parallel prefix computation on n/ even-indexed inputs T(p) = T(p/) + : : : : Parallel prefix computation on n/ odd-indexed inputs Each vertical line represents a location in shared memory :n- :n- Strictly optimal algorithm, but requires commutativity Fig 58 Another divide-and-conquer scheme for parallel prefix computation T(p) = log p Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

19 head 55 Ranking the Elements of a Linked List info next C F A E B D Terminal element Rank: 5 4 (or distance from terminal) Distance from head: Fig 59 Example linked list and the ranks of its elements List ranking appears to be hopelessly sequential; one cannot get to a list element except through its predecessor! head Fig 5 PRAM data structures representing a linked list and the ranking results 4 5 info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9 A B C D E F 4 5

20 List Ranking via Recursive Doubling Many problems that appear to be unparallelizable to the uninitiated are parallelizable; Intuition can be quite misleading! info next 4 4 head A B C 4 5 D E 5 F Fig 5 Element ranks initially and after each of the three iterations Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

21 Question: Which PRAM submodel is implicit in this algorithm? PRAM List Ranking Algorithm PRAM list ranking algorithm (via pointer jumping) Processor j, j < p, do {initialize the partial ranks} if next[ j ] = j then rank[ j ] := else rank[ j ] := endif while rank[next[head]] Processor j, j < p, do rank[ j ] := rank[ j ] + rank[next[ j ]] next[ j ] := next[next[ j ]] A 4 endwhile Answer: CREW head 4 5 If we do not want to modify the original list, we simply make a copy of it first, in constant time info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide B C D E F 5

22 56 Matrix Multiplication Sequential matrix multiplication for i = to m do for j = to m do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor endfor m m matrices i j PRAM solution with m processors: each processor does one multiplication (not very efficient) c ij := k= to m a ik b kj A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

23 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors Proc (i, j), i, j < m, do begin t := for k = to m do t := t + a ik b kj endfor c ij := t j end i Processors are numbered (i, j), instead of to m (m) steps: Time-optimal CREW model is implicit A B C = ij Fig 5 PRAM matrix multiplication; p = m processors Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

24 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors for j = to m Proc i, i < m, do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor i j (m ) steps: Time-optimal CREW model is implicit Because the order of multiplications is immaterial, accesses to B can be skewed to allow the EREW model A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

25 PRAM Matrix Multiplication with Fewer Processors Algorithm is similar, except that each processor is in charge of computing m / p rows of C (m /p) steps: Time-optimal EREW model can be used A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access This is particularly costly for NUMA shared-memory machines j i A B = C ij m / p rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

26 More Efficient Matrix Multiplication (for NUMA) Partition the matrices into p square blocks p p q q=m/ p q=m/ p A C B D One processor computes these elements of C that it holds in local memory E F = AE+BG AF+BH G H CE+DG CF+DH Block matrix multiplication follows the same algorithm as simple matrix multiplication Block-band j p p Blockband i = Block ij Fig 5 Partitioning the matrices for block matrix multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

27 Details of Block Matrix Multiplication Element of block (i, k) in matrix A iq iq + iq + a iq + q - kq + c kq + c iq iq + iq + a iq + q - jq jq + jq + b jq + q - Multiply jq jq + jq + b jq + q - Add Elements of block (k, j) in matrix B Elements of block (i, j) in matrix C A multiply-add computation on q q blocks needs q = m /p memory accesses and q arithmetic operations So, q arithmetic operations are done per memory access Fig 54 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

28 6A More Shared-Memory Algorithms Develop PRAM algorithm for more complex problems: Searching, selection, sorting, other nonnumerical tasks Must present background on the problem in some cases Topics in This Chapter 6A Parallel Searching Algorithms 6A Sequential Rank-Based Selection 6A A Parallel Selection Algorithm 6A4 A Selection-Based Sorting Algorithm 6A5 Alternative Sorting Algorithms 6A6 Convex Hull of a D Point Set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

29 6A Parallel Searching Algorithms Searching an unordered list in PRAM Sequential time: n worst-case n/ on average Divide the list of n items into p segments of n / p items (last segment may have fewer) Processor i will be in charge of n / p list elements, beginning at address i n / p Parallel time: n / p worst-case??? on average Perfect speed-up of p with p processors? Pre- and postprocessing overheads Example: n = 4, p = A A A A Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

30 Parallel (p + )-ary Search on PRAM p probes, rather than, per step log p+ (n + ) = log (n + ) / log (p + ) = (log n / log p) steps Speedup log p Optimal: no comparison-based Example: search algorithm can be faster n = 6 A single search in a sorted list can t be significantly speeded p = up through parallel processing, but all hope is not lost: 8 7 Example: n = 6, p = P P P P From Sec 8 P P Dynamic data (sorting overhead) Batch searching (multiple lookups) 5 Step Step Step Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

31 6A Sequential Ranked-Based Selection Selection: Find the (or a) kth smallest among n elements Example: 5th smallest element in the following list is : Naive solution through sorting, O(n log n) time q Median k < L < m = m L E n m n/q But linear-time sequential algorithm can be developed m = the median of the medians: < n/4 elements > n/4 elements > m G k > L + E max( L, G ) n/4 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

32 Linear-Time Sequential Selection Algorithm O(n) T(n/q) O(n) To be justified T(n/4) Sequential rank-based selection algorithm select(s, k) if S < q {q is a small constant} then sort S and return the kth smallest element of S else divide S into S /q subsequences of size q Sort each subsequence and find its median Let the S /q medians form the sequence T endif m = select(t, T /) {find the median m of the S /q medians} Create subsequences L: Elements of S Median q that are < m E: Elements of S that are = m G: Elements of S that are > m 4 if L k then return select(l, n k) else if L + E k then return m m m = the median of the medians: < n/4 elements > n/4 elements else return select(g, k L E ) endif endif k < L n/q < m = m > m L E G k > L + E Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

33 Algorithm Complexity and Examples T(n) =T(n / q) +T(n/4) + cn We must have q 5; for q = 5, the solution is T(n) = cn n/q sublists of q elements S T 6 5 m L E G L = 7 E = G = 6 To find the 5th smallest element in S, select the 5th smallest element in L S T m L E G Answer: The 9th smallest element of S is The th smallest element of S is found by selecting the 4th smallest element in G Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

34 O(n x ) T(n x,p) O(n x ) T(n/4, p) 6A A Parallel Selection Algorithm Parallel rank-based selection algorithm PRAMselect(S, k, p) if S < 4 then sort S and return the kth smallest element of S else broadcast S to all p processors divide S into p subsequences S(j) of size S /p Processor j, j < p, compute T j := select(s(j), S(j) /) endif m = PRAMselect(T, T /, p) {median of the medians} Broadcast m to all processors and create subsequences L: Elements of S that are < m Median E: Elements of Sq that are = m G: Elements of S that are > m 4 if L k then return PRAMselect(L, k, p) n else if L + E k Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4 m k < L m = the median then return m of the medians: < n/4 elements else return PRAMselect(G, k > L n/4 elements E, p) endif endif Let p =O(n x ) n/q < m = m > m L E G k > L + E

35 Algorithm Complexity and Efficiency T(n,p) =T(n x,p) +T(n/4,p) + cn x The solution is O(n x ); verify by substitution Speedup = (n)/o(n x ) = (n x ) = (p) Efficiency = () Work(n, p) = pt(n, p) = (n x ) (n x ) = (n) Remember p =O(n x ) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x ) = O(n) What happens if we set x to? (ie, use n processors) T(n, n) = O(n x ) = O()? No, because in asymptotic analysis, we ignored several O(log n) terms compared with O(n x ) terms Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

36 Data Movement in Step of the Algorithm n x a items a b b items < m L n c a b c items = m E c > m G Consider the sublist L: Processor i contributes a i items to this sublist Processor starts storing at location, processor at location a, processor at location a + a, Processor at location a + a + a, Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

37 O() 6A4 A Selection-Based Sorting Algorithm O(n x ) O(n x ) T(n/k,p/k) T(n/k,p/k) Parallel selection-based sort PRAMselectionsort(S, p) if S < k then return quicksort(s) for i = to k do m j := PRAMselect(S, i S /k, p) {let m := ; m k := + } endfor for i = to k do make the sublist T(i) from elements of S in (m i, m i+ ) endfor 4 for i = to k/ do in parallel PRAMselectionsort(T(i), p/k) {p/(k/) proc s used for each of the k/ subproblems} endfor 5 for i = k/ + to k do in parallel PRAMselectionsort(T(i), p/k) endfor Let p = n x and k = /x n/k elements n/k n/k n/k m m m m + k Fig 6 Partitioning of the sorted list for selection-based sorting Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

38 Algorithm Complexity and Efficiency T(n, p) = T(n/k, p/k) + cn x The solution is O(n x log n); verify by substitution Speedup(n, p) = (nlogn)/o(n x logn) = (n x ) = (p) Efficiency = speedup / p = () Work(n, p) = pt(n, p) = (n x ) (n x log n) = (n log n) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x log n) = O(n log n) Remember p =O(n x ) Our asymptotic analysis is valid for x > but not for x = ; ie, PRAMselectionsort cannot sort p keys in optimal O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

39 Example of Parallel Sorting S: Threshold values for k = 4 (ie, x = ½ and p = n / processors): m = n/k = 5/4 6 m = PRAMselect(S, 6, 5) = n/k = 5/4 m = PRAMselect(S,, 5) = 4 n/k = 75/4 9 m = PRAMselect(S, 9, 5) = 6 m 4 = + m m m m m 4 T: T: Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

40 6A5 Alternative Sorting Algorithms Sorting via random sampling (assume p << n) Given a large list S of inputs, a random sample of the elements can be used to find k comparison thresholds It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor Parallel randomized sort PRAMrandomsort(S, p) Processor j, j < p, pick S /p random samples of its S /p elements and store them in its corresponding section of a list T of length S /p Processor sort the list T {comparison threshold m i is the (i S /p )th element of T} Processor j, j < p, store its elements falling in (m i, m i+ ) into T(i) 4 Processor j, j < p, sort the sublist T(j) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

41 Parallel Radixsort In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the LSB In Step i, bit i is examined, i < k Records are stably sorted by the value of the ith key bit Binary forms Question: How are the data movements performed? Input Sort by Sort by Sort by list LSB middle bit MSB 5 () 4 () 4 () () 7 () () 5 () () () () () () () 5 () () () 4 () 7 () () 4 () () () 7 () 5 () 7 () () () 7 () () 7 () 7 () 7 () Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

42 Data Movements in Parallel Radixsort Input Compl Diminished Prefix sums Shifted list of bit prefix sums Bit plus list 5 () + = 4 () 7 () + = 4 () () + = 5 () () 4 + = 6 5 () 4 () 7 () () () 7 () 5 + = 7 () () 7 () Running time consists mainly of the time to perform k parallel prefix computations: O(log p) for k constant Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

43 6A6 Convex Hull of a D Point Set Best sequential algorithm for p points: (p log p) steps y 7 y Angle Point Set Q 4 5 x y Fig 6 Defining the convex hull problem 7 8 CH(Q) 5 4 x x All points fall on this side of line 5 Fig 6 Illustrating the properties of the convex hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

44 PRAM Convex Hull Algorithm Parallel convex hull algorithm PRAMconvexhull(S, p) Sort point set by x coordinates Divide sorted list into p subsets Q (i) of size p, i < p Find convex hull of each subset Q (i) using p processors 4 Merge p convex hulls CH(Q (i) ) into overall hull CH(Q) y y Upper hull Q () Q () Q () Q () x () CH(Q ) Lower hull Fig 64 Multiway divide and conquer for the convex hull problem Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 44

45 Merging of Partial Convex Hulls (j) CH(Q ) Tangent lines (i) CH(Q ) (a) No point of CH(Q(i)) is on CH(Q) A B (k) CH(Q ) Tangent lines are found through binary search in log time Analysis: T(p, p) = T(p /, p / ) + c log p c log p (j) CH(Q ) (i) CH(Q ) (k) CH(Q ) The initial sorting also takes O(log p) time (b) Points of CH(Q(i)) from A to B are on CH(Q) Fig 65 Finding points in a partial hull that belong to the combined hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 45

46 6B Implementation of Shared Memory Main challenge: Easing the memory access bottleneck Providing low-latency, high-bandwidth paths to memory Reducing the need for access to nonlocal memory Reducing conflicts and sensitivity to memory latency Topics in This Chapter 6B Processor-Memory Interconnection 6B Multistage Interconnection Networks 6B Cache Coherence Protocols 6B4 Data Allocation for Conflict-Free Access 6B5 Distributed Shared Memory 6B6 Methods for Memory Latency Hiding Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 46

47 About the New Chapter 6B This new chapter incorporates material from the following existing sections of the book: 66 Some Implementation Aspects 44 Dimension-Order Routing 5 Plus-or-Minus- i Network 66 Multistage Interconnection Networks 7 Distributed Shared Memory 8 Data Access Problems and Caching 8 Cache Coherence Protocols 8 Multithreading and Latency Hiding Coordination and Synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 47

48 Making PRAM Practical PRAM needs a read and a write access to memory in every cycle Even for a sequential computer, memory is tens to hundreds of times slower than arithmetic/logic operations; multiple processors accessing a shared memory only makes the situation worse Shared access to a single large physical memory isn t scalable Strategies and focal points for making PRAM practical Make memory accesses faster and more efficient (pipelining) Reduce the number of memory accesses (caching, reordering of accesses so as to do more computation per item fetched/stored) Reduce synchronization, so that slow memory accesses for one computation do not slow down others (synch, memory consistency) 4 Distribute the memory and data to make most accesses local 5 Store data structures to reduce access conflicts (skewed storage) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 48

49 6B Processor-Memory Interconnection Processors Memory modules Processorto-processor network Processorto-memory network p- m- Parallel I/O Options: Crossbar Bus(es) MIN Bottleneck Complex Expensive Fig 4 A parallel processor with global (shared) memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 49

50 Processor-to-Memory Network P P P P P 4 P 5 P 6 P 7 Crossbar switches offer full permutation capability (they are nonblocking), but are complex and expensive: O(p ) Even with a permutation network, full PRAM functionality is not realized: two processors cannot access different addresses in the same memory module Practical processor-tomemory networks cannot M M M M M 4 M 5 M 6 M 7 realize all permutations An 8 8 crossbar switch (they are blocking) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

51 Bus-Based Interconnections Single-bus system: Bandwidth bottleneck Bus loading limit Scalability: very poor Single failure point Conceptually simple Forced serialization Multiple-bus system: Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Simple serialization Memory I/O Proc Proc Proc Proc Memory I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

52 Back-of-the-Envelope Bus Bandwidth Calculation Single-bus system: Bus frequency: 5 GHz Data width: 56 b ( B) Mem Access: bus cycles (5G)/ = 8 GB/s Bus cycle = ns Memory cycle = ns mem cycle = 5 bus cycles Multiple-bus system: Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less) Memory I/O Proc Proc Proc Proc Mem Mem Mem Mem I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

53 Hierarchical Bus Interconnection Bus switch (Gateway) Low-level cluster Heterogeneous system Fig 49 Example of a hierarchical interconnection architecture Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

54 Removing the Processor-to-Memory Bottleneck Processors Caches Memory modules Processorto-processor network Challenge: Cache coherence Processorto-memory network Fig 44 p- m- Parallel I/O A parallel processor with global memory and processor caches Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 54

55 Why Data Caching Works Hit rate r (fraction of memory accesses satisfied by cache) C eff = C fast + ( r)c slow Cache parameters: Size Block length (line width) Placement policy Replacement policy Write policy Address in Tag Index Block address Word offset Placement Option Placement Option State bits State bits Tag --- One cache block --- Tag --- One cache block --- The two candidate words and their tags are read out Fig 8 Data storage and access in a two-way set-associative cache Compare Compare = = Cache miss Mux Data out Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 55

56 Benefits of Caching Formulated as Amdahl s Law Hit rate r (fraction of memory accesses satisfied by cache) Access cabinet in s Access drawer in 5 s C eff = C fast + ( r)c slow S = C slow / C eff Register file Access desktop in s Cache memory = ( r) + Cfast /C slow Main memory Fig 8 of Parhami s Computer Architecture text (5) This corresponds to the miss-rate fraction r of accesses being unaffected and the hit-rate fraction r (almost ) being speeded up by a factor C slow /C fast Generalized form of Amdahl s speedup formula: S = /(f /p + f /p + + f m /p m ), with f + f + + f m = In this case, a fraction r is slowed down by a factor (C slow + C fast )/C slow, and a fraction r is speeded up by a factor C slow / C fast Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 56

57 6B Multistage Interconnection Networks Fig 5 Key elements of the Cray Y-MP processor Address registers, address function units, instruction buffers, and control not shown The Vector-Parallel Cray Y-MP Computer Inter- Processor Commun Central Memory Input/Output Vector Registers 6 6 CPU V V7 V6 V5 V4 V V V T Registers (8 64-bit) S Registers (8 64-bit) From address registers/units Vector Integer Units Add Shift Logic Weight/ Parity Floating- Point Add Units Multiply Reciprocal Approx Scalar Integer Units Add Shift Logic Weight/ Parity Winter 6 Parallel Processing, Shared-Memory Parallelism Slide bit 64 bit bit

58 Cray Y-MP s Interconnection Network P P Sections Subsections Memory Banks 8, 4, 8,, 8, 6, 4,, P P P 4 P 5 P 6 P , 5, 9,, 9, 6,,,, 7,,, 7,,, 55 Fig 6 The processor-to-memory interconnection network of Cray Y-MP Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 58

59 p Processors Butterfly Processor-to-Memory Network Fig 69 Example of a multistage memory access network log p Columns of -by- Switches p Memory Banks Two ways to use the butterfly: - Edge switches x and x - All switches x Not a full permutation network (eg, processor cannot be connected to memory bank alongside the two connections shown) Is self-routing: ie, the bank address determines the route A request going to memory bank ( ) is routed: lower upper upper Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 59

60 p Processors Butterfly as Multistage Interconnection Network log p Columns of -by- Switches p Memory Banks log p + Columns of -by- Switches Fig 69 Example of a multistage memory access network Generalization of the butterfly network High-radix or m-ary butterfly, built of m m switches Has m q rows and q + columns (q if wrapped) Fig 58 Butterfly network used to connect modules that are on the same side Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

61 Self-Routing on a Butterfly Network Fig 46 Example dimension-order routing paths dim dim dim Ascend Descend 4 4 Number of cross links taken = length of path in hypercube From node to 6: routing tag = = cross-straight-cross From node to 5: routing tag = = straight-cross-cross From node 6 to : routing tag = = cross-cross-cross Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

62 Butterfly Is Not a Permutation Network dim dim dim A dim dim dim A B C B D 4 C D Fig 47 Packing is a good routing problem for dimensionorder routing on the hypercube 7 7 Fig 48 Bit-reversal permutation is a bad routing problem for dimensionorder routing on the hypercube 7 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

63 Structure of Butterfly Networks Dim Dim Dim Dim Dim Dim Dim Switching these two row pairs converts this to the original butterfly network Changing the order of stages in a butterfly is thus equi valent to a relabeling of the rows (in this example, row xyz becomes row xzy) Fig 55 Butterfly network with permuted dimensions The 6-row butterfly network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

64 Beneš Network Processors log p Columns of -by- Switches 4 Memory Banks Fig 59 Beneš network formed from two back-to-back butterflies A q -row Beneš network: Can route any q q permutation It is rearrangeable Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 64

65 Routing Paths in a Beneš Network To which memory modules can we connect proc 4 without rearranging the other paths? What about proc 6? q q+ q+ Inputs Rows, q + Columns Outputs Fig 5 Another example of a Beneš network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 65

66 Augmented Data Manipulator Network a b Data manipulator network was used in Goodyear MPP, an early SIMD parallel machine Augmented means that switches in a column are independent, as opposed to all being set to same state (simplified control) q Rows Fig 5 Augmented data manipulator network 7 a b 7 q + Columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 66

67 Fat Trees P P P 4 P P 5 P P 6 Fig 56 Two representations of a fat tree Front view: Binary tree Side view: Inverted binary tree Skinny tree? Fig 57 Butterfly network redrawn as a fat tree P 7 P 8 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 67

68 The Sea of Indirect Interconnection Networks Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes We have already seen the butterfly, hierarchical bus, beneš, and ADM networks Fig 48 (modified) The sea of indirect interconnection networks Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 68

69 Self-Routing Permutation Networks Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network) Permutation routing through a MIN is the same problem as sorting 7 () () () () 4 () () 6 () () () () 5 () () () () () () Sort by MSB Sort by the middle bit Sort by LSB Fig 64 Example of sorting on a binary radix sort network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 69

70 Partial List of Important MINs Augmented data manipulator (ADM): aka unfolded PMI (Fig 5) Banyan: Any MIN with a unique path between any input and any output (eg butterfly) Baseline: Butterfly network with nodes labeled differently Beneš: Back-to-back butterfly networks, sharing one column (Figs 59-) Bidelta: A MIN that is a delta network in either direction Butterfly: aka unfolded hypercube (Figs 69, 54-5) Data manipulator: Same as ADM, but with switches in a column restricted to same state Delta: Any MIN for which the outputs of each switch have distinct labels (say and for switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output Flip: Reverse of the omega network (inputs outputs) Indirect cube: Same as butterfly or omega Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig 59) Permutation: Any MIN that can realize all permutations Rearrangeable: Same as permutation network Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

71 Proc- toproc network 6B Cache Coherence Protocols Processors p Caches w y w z x z Processorto-memory network Parallel I/O Memory modules w x Fig 8 Various types of cached data blocks in a parallel processor with global memory and processor caches y z w x y z Multiple consistent Single consistent Single inconsistent Invalid Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

72 Example: A Bus-Based Snoopy Protocol Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken CPU read hit, CPU write hit CPU read hit CPU write miss: Write back the block, Put write miss on bus Exclusive (read/write) CPU read miss: Write back the block, put read miss on bus Bus read miss for this block: Write back the block CPU write hit/miss: Put write miss on bus Bus write miss for this block: Write back the block CPU read miss: Put read miss on bus Shared (read-only) CPU read miss: Put read miss on bus CPU write miss: Put write miss on bus Invalid Bus write miss for this block Fig 8 Finite-state control mechanism for a bus-based snoopy cache coherence protocol Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

73 Implementing a Snoopy Protocol A second tags/state storage unit allows snooping to be done concurrently with normal cache operation Getting all the implementation timing and details right is nontrivial Duplicate tags and state store for snoop side =? =? Snoop state Snoop side cache control Cmd Tags State Tag Cache data array Addr CPU Cmd Addr Buffer Buffer Addr Cmd Fig 77 of Parhami s Computer Architecture text Main tags and state store for processor side Processor side cache control System bus Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

74 Scalable (Distributed) Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 74

75 Example: A Directory-Based Protocol Write miss: Fetch data value, request invalidation, return data value, sharing set = {c} Read miss: Return data value, sharing set = sharing set + {c} Exclusive (read/write) Read miss: Fetch data value, return data value, Read miss: sharing Fetch, set return = sharing data set value, + {c} sharing set = {c} Correction to text Write miss: Invalidate, sharing set = {c}, return data value Shared (read-only) Data write-back: Sharing set = { } Write miss: Return data value, sharing set = {c} Uncached Read miss: Return data value, sharing set = {c} Fig 84 States and transitions for a directory entry in a directory-based coherence protocol (c denotes the cache sending the message) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 75

76 Implementing a Directory-Based Protocol Sharing set implemented as a bit-vector (simple, but not scalable) When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory Processor Processor Processor Processor Cache Cache Cache Cache Head pointer Memory Coherent data block Noncoherent data blocks The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 86 in connection with the SCI standard) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 76

77 6B4 Data Allocation for Conflict-Free Access Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping Row,,,, 4, 5,,,,, 4, 5, Column,,,, 4, 5,,,,, 4, 5,,4,4,4,4 4,4 5,4,5,5,5,5 4,5 5,5 Module 4 5 Each matrix column is stored in a different memory module (bank) Accessing a column leads to conflicts Fig 66 Matrix storage in column-major order to allow concurrent accesses to rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 77

78 Row Column,,5,4, 4, 5, Skewed Storage Format,,,5,4 4, 5,,,,,5 4,4 5,,,,, 4,5 5,4,4,,, 4, 5,5,5,4,, 4, 5, Module 4 5 Fig 67 Skewed matrix storage for conflict-free accesses to rows and columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 78

79 A Unified Theory of Conflict-Free Access Vector indices A ij is viewed as vector element i + jm Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector A qd array can be viewed as a vector, with row / column accesses associated with constant strides Column: k, k+, k+, k+, k+4, k+5 Stride = Row: k, k+m, k+m, k+m, k+4m, k+5m Stride = m Diagonal: k, k+m+, k+(m+), k+(m+), k+4(m+), k+5(m+) Stride = m + Antidiagonal: k, k+m, k+(m ), k+(m ), k+4(m ), k+5(m ) Stride = m Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 79

80 Linear Skewing Schemes Vector indices A is viewed as vector element i + jm ij Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector Place vector element i in memory bank a + bi mod B (word address within bank is irrelevant to conflict-free access; also, a can be set to ) With a linear skewing scheme, vector elements k, k + s, k +s,, k + (B )s will be assigned to different memory banks iff sb is relatively prime with respect to the number B of memory banks A prime value for B ensures this condition, but is not very practical Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

81 6B5 Distributed Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

82 Butterfly-Based Distributed Shared Memory Randomized emulation of the p-processor PRAM on p-node butterfly Use hash function to map memory locations to modules p locations p modules, not necessarily distinct One of p = q (q + ) processors With high probability, at most O(log p) of the p locations will be in modules located in the same row Average slowdown = O(log p) P M P M P M P M dim dim dim q + Columns Fig 7 Butterfly distributed-memory machine emulating the PRAM Memory module holding m/p memory locations Each node = router + processor + memory q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

83 PRAM Emulation with Butterfly MIN Emulation of the p-processor PRAM on (p log p)-node butterfly, with memory modules and processors connected to the two sides; O(log p) avg slowdown q One of p = processors Less efficient than Fig 7, which uses a smaller butterfly By using p / (log p) physical processors to emulate the p-processor PRAM, this new emulation scheme becomes quite efficient (pipeline the memory accesses of the log p virtual processors assigned to each physical processor) P dim dim dim q + Columns Fig 7 Distributed-memory machine, with a butterfly multistage interconnection network, emulating the PRAM M Memory module holding m/p memory locations q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

84 Deterministic Shared-Memory Emulation Deterministic emulation of p-processor PRAM on p-node butterfly One of p = q (q + ) processors Store log m copies of each of the m memory location contents Time-stamp each updated value A write is complete once a majority of copies are updated A read is satisfied when a majority of copies are accessed and the one with latest time stamp is used Why it works: A few congested links won t delay the operation P M P M P M P M dim dim dim q + Columns Write set log m copies Memory module holding m/p memory locations Each node = router + processor + memory q Rows Read set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 84

85 PRAM Emulation Using Information Dispersal Instead of (log m)-fold replication of data, divide each data element into k pieces and encode the pieces using a redundancy factor of, so that any k / pieces suffice for reconstructing the original data Original data word and its k pieces Possible read set of size k/ Up-to-date pieces Recunstruction algorithm Possible update set of size k/ The k pieces after encoding (approx three times larger) Original data word recovered from k/ encoded pieces Fig 74 Illustrating the information dispersal approach to PRAM emulation with lower data redundancy Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 85

86 6B6 Methods for Memory Latency Hiding By assumption, PRAM accesses memory locations right when they are needed, so processing must stall until data is fetched Method : Predict accesses (prefetch) Method : Pipeline multiple accesses Processors Not a smart strategy: Memory access time = s times that of add time Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 86

87 6C Shared-Memory Abstractions A precise memory view is needed for correct algorithm design Sequential consistency facilitates programming Less strict consistency models offer better performance Topics in This Chapter 6C Atomicity in Memory Access 6C Strict and Sequential Consistency 6C Processor Consistency 6C4 Weak or Synchronization Consistency 6C5 Other Memory Consistency Models 6C6 Transactional Memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 87

88 6C Atomicity in Memory Access Performance optimization and latency hiding often imply that memory accesses are interleaved and perhaps not serviced in the order issued Processors Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 88

89 Barrier Synchronization Overhead P P P P P P P P Synchronization overhead Given that AND is a semigroup computation, it is only a small step to generalize it to a more flexible global combine operation Time Done Done Reduction of synchronization overhead: Providing hardware aid to do it faster Using less frequent synchronizations Fig The performance benefit of less frequent synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 89

90 Synchronization via Message Passing Vertex v represents j j Task task or Computation computation jj Task interdependence is often more complicated than the simple prerequisite structure thus far considered x T p T Latency p Latency with with p pro p p T T Number Number of of nodes T T Depth Depth of the of the graph Schematic representation of data dependence A B x x Details of dependence Communication { latency t t t Time Process A: send x Process B: receive x B waits Fig Automatic synchronization in message-passing systems y 4 Output Output 6 9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

91 Synchronization with Shared Memory Accomplished by accessing specially designated shared control variables The fetch-and-add instruction constitutes a useful atomic operation If the current value of x is c, fetch-and-add(x, a) returns c to the process and overwrites x = c with the value c + a A second process executing fetch-and-add(x, b) then gets the now current value c + a and modifies it to c + a + b Why atomicity of fetch-and-add is important: With ordinary instructions, the steps of fetch-and-add for A and B may be interleaved as follows: Process A Process B Comments Time step read x A s accumulator holds c Time step read x B s accumulator holds c Time step add a A s accumulator holds c + a Time step 4 add b B s accumulator holds c + b Time step 5 store x x holds c + a Time step 6 store x x holds c + b (not c + a + b) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

92 Barrier Synchronization: Implementations Make each processor, in a designated set, wait at a barrier until all other processors have arrived at the corresponding points in their computations Software implementation via fetch-and-add or similar instruction Hardware implementation via an AND tree (raise flag, check AND result) A problem with the AND-tree: If a processor can be randomly delayed between raising it flag and checking the tree output, some processors might cross the barrier and lower their flags before others have noticed the change in the AND tree output Solution: Use two AND trees for alternating barrier points fo fo fo fop fe fe fe fep Set AND tree Reset AND tree S Q Flipflop R Barrier Signal Fig 4 Example of hardware aid for fast barrier synchronization [Hoar96] Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

93 6C Strict and Sequential Consistency Strict consistency: A read operation always returns the result of the latest write operation on that data object Strict consistency is impossible to maintain in a distributed system which does not have a global clock While clocks can be synchronized, there is always some error that causes trouble in near-simultaneous operations Example: Three processes sharing variables -4 (r = read, w = write) Process A Process B Process C w r w w r r w4 r r r w r5 r w w4 r4 r r Real time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

94 Sequential Consistency Sequential consistency (original def): The result of any execution is the same as if processor operations were executed in some sequential order, and the operations of a particular processor appear in the sequence specified by the program it runs Process A Process B w r w w r r w4 r r r w Process C r5 r w w4 r4 r r Time A possible ordering A possible ordering w4w r r5 r r w r r ww4w r4 w r r r r w4 r5 r w r r w r ww4w r4 r r r r r w Sequential consistency (new def): Write operations on the same data object are seen in exactly the same order by all system nodes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 94

95 The Performance Penalty of Sequential Consistency Initially Thread Thread X = Y = X := Y := R := Y R :=X Exec Exec Exec X := Y := X := R := Y R :=X Y := Y := X := R := Y R := X R := Y R := X If a compiler reorders the seemingly independent statements in Thread, the desired semantics (R and R not being both ) is compromised Relaxed consistency (memory model): Ease the requirements on doing things in program order and/or write atomicity to gain performance When maintaining order is absolutely necessary, we use synchronization primitives to enforce it Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 95

Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1

Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1 Part II Extreme Models Fall 8 Parallel Processing, Extreme Models Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms

More information

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop

More information

Part II. Extreme Models. Spring 2006 Parallel Processing, Extreme Models Slide 1

Part II. Extreme Models. Spring 2006 Parallel Processing, Extreme Models Slide 1 Part II Extreme Models Spring 6 Parallel Processing, Extreme Models Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed

More information

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

COMP Parallel Computing. PRAM (4) PRAM models and complexity

COMP Parallel Computing. PRAM (4) PRAM models and complexity COMP 633 - Parallel Computing Lecture 5 September 4, 2018 PRAM models and complexity Reading for Thursday Memory hierarchy and cache-based systems Topics Comparison of PRAM models relative performance

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

CS256 Applied Theory of Computation

CS256 Applied Theory of Computation CS256 Applied Theory of Computation Parallel Computation IV John E Savage Overview PRAM Work-time framework for parallel algorithms Prefix computations Finding roots of trees in a forest Parallel merging

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2 The PRAM model A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2 Introduction The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer. A PRAM consists of

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Fundamental Algorithms

Fundamental Algorithms Fundamental Algorithms Chapter 6: Parallel Algorithms The PRAM Model Jan Křetínský Winter 2017/18 Chapter 6: Parallel Algorithms The PRAM Model, Winter 2017/18 1 Example: Parallel Sorting Definition Sorting

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

F. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES

F. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES F. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES MORGAN KAUFMANN PUBLISHERS SAN MATEO, CALIFORNIA Contents Preface Organization of the Material Teaching

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Part V. Some Broad Topic. Winter 2016 Parallel Processing, Some Broad Topics Slide 1

Part V. Some Broad Topic. Winter 2016 Parallel Processing, Some Broad Topics Slide 1 Part V Some Broad Topic Winter 06 Parallel Processing, Some Broad Topics Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing:

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1 Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture:

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are

More information

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K. Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Parallel Random-Access Machines

Parallel Random-Access Machines Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69 Plan 1 The PRAM Model 2 Performance

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Memory Hierarchy Basics

Memory Hierarchy Basics Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Sorting (Chapter 9) Alexandre David B2-206

Sorting (Chapter 9) Alexandre David B2-206 Sorting (Chapter 9) Alexandre David B2-206 1 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

The PRAM Model. Alexandre David

The PRAM Model. Alexandre David The PRAM Model Alexandre David 1.2.05 1 Outline Introduction to Parallel Algorithms (Sven Skyum) PRAM model Optimality Examples 11-02-2008 Alexandre David, MVP'08 2 2 Standard RAM Model Standard Random

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information