Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1
|
|
- Coleen Hillary Sullivan
- 6 years ago
- Views:
Transcription
1 Part II Shared-Memory Parallelism Part II Shared-Memory Parallelism Shared-Memory Algorithms Implementations and Models 5 PRAM and Basic Algorithms 6A More Shared-Memory Algorithms 6B Implementation of Shared Memory 6C, Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 999, ISBN ) It was prepared by the author in connection with teaching the graduate-level course ECE 54B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara Instructors can use these slides in classroom teaching and for other educational purposes Any other use is strictly prohibited Behrooz Parhami Edition Released Revised Revised Revised First Spring 5 Spring 6 Fall 8 Fall Winter Winter 4 Winter 6 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
3 II Shared-Memory Parallelism Shared memory is the most intuitive parallel user interface: Abstract SM (PRAM); ignores implementation issues Implementation w/o worsening the memory bottleneck Shared-memory models and their performance impact Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6A More Shared-Memory Algorithms Chapter 6B Implementation of Shared Memory Chapter 6C Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
4 5 PRAM and Basic Algorithms PRAM, a natural extension of RAM (random-access machine): Present definitions of model and its various submodels Develop algorithms for key building-block computations Topics in This Chapter 5 PRAM Submodels and Assumptions 5 Data Broadcasting 5 Semigroup or Fan-in Computation 54 Parallel Prefix Computation 55 Ranking the Elements of a Linked List 56 Matrix Multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
5 Why Start with Shared Memory? Study one extreme of parallel computation models: Abstract SM (PRAM); ignores implementation issues This abstract model is either realized or emulated In the latter case, benefits are similar to those of HLLs In Part II, we will study the other extreme case of models: Concrete circuit model; incorporates hardware details Allows explicit latency/area/energy trade-offs Facilitates theoretical studies of speed-up limits Everything else falls between these two extremes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
6 Processors 5 PRAM Submodels and Assumptions p Shared Memory Fig 46 Conceptual view of a parallel random-access machine (PRAM) m Processor i can do the following in three phases of one cycle: Fetch a value from address s i in shared memory Perform computations on data held in local registers Store a value into address d i in shared memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
7 Types of PRAM Reads from same location Exclusive Concurrent Writes to same location Exclusive Concurrent EREW Least powerful, most realistic ERCW Not useful CREW Default CRCW Most powerful, further subdivided Fig 5 Submodels of the PRAM model Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
8 Types of CRCW PRAM CRCW submodels are distinguished by the way they treat multiple writes: Undefined: Detecting: Common: Random: Priority: Max / Min: Reduction: The value written is undefined (CRCW-U) A special code for detected collision is written (CRCW-D) Allowed only if they all store the same value (CRCW-C) [ This is sometimes called the consistent-write submodel ] The value is randomly chosen from those offered (CRCW-R) The processor with the lowest index succeeds (CRCW-P) The largest / smallest of the values is written (CRCW-M) The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
9 Power of CRCW PRAM Submodels Model U is more powerful than model V if T U (n)=o(t V (n)) for some problem EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P Theorem 5: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list Sort the list of addresses in ascending order Remove all duplicate addresses Access data at desired addresses Replicate data via parallel prefix computation Each of these steps requires constant or O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
10 Implications of the CRCW Hierarchy of Submodels EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown Efficient parallel algorithms have polylogarithmic running times Running time still polylogarithmic after slowdown due to emulation We need not be too concerned with the CRCW submodel used Simply use whichever submodel is most natural or convenient Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
11 Some Elementary PRAM Computations Initializing an n-vector (base address = B) to all s: p elements for j = to n/p processor i do if jp + i < n then M[B + jp + i] := endfor n / p segments Adding two n-vectors and storing the results in a third (base addresses B, B, B) Convolution of two n-vectors: W k = i+j=k U i V j (base addresses B W, B U, B V ) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
12 5 Data Broadcasting Making p copies of B[] by recursive doubling for k = to log p Proc j, j < p, do Copy B[j] into B[j + k ] endfor Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded B Fig 5 Data broadcasting in EREW PRAM via recursive doubling B Fig 5 EREW PRAM data broadcasting without redundant copying Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
13 Class Participation: Broadcast-Based Sorting Each person write down an arbitrary nonnegative integer with or fewer digits on a piece of paper Students take turn broadcasting their numbers by calling them out aloud Each student puts an X on paper for every number called out that is smaller than his/her own number, or is equal but was called out before the student s own value j p Each student counts the number of Xs on paper to determine the rank of his/her number Students call out their numbers in order of the computed rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
14 All-to-All Broadcasting on EREW PRAM EREW PRAM algorithm for all-to-all broadcasting Processor j, j < p, write own data value into B[j] for k = to p Processor j, j < p, do Read the data value in B[(j + k) mod p] endfor This O(p)-step algorithm is time-optimal j p Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, j < p, write into R[ j ] for k = to p Processor j, j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + endif endfor Processor j, j < p, write S[ j ] into S[R[ j ]] This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
15 5 Semigroup or Fan-in Computation EREW PRAM semigroup computation algorithm Proc j, j < p, copy X[j] into S[j] s := while s <p Proc j, j < p s, do S[j + s] := S[j] S[j + s] s := s endwhile Broadcast S[p ] to all processors This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not If we use p processors on a list of size n = O(p log p), then optimal speedup can be achieved S Fig 54 Semigroup computation in EREW PRAM Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5 : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 55 Intuitive justification of why parallel slack helps improve the efficiency : : : : :4 :5 :6 :7 :8 :9 Lower degree of parallelism near the root Higher degree of parallelism near the leave
16 54 Parallel Prefix Computation Same as the first part of semigroup computation (no final broadcasting) S : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 56 Parallel prefix computation in EREW PRAM via recursive doubling : : : : :4 :5 :6 :7 :8 :9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
17 A Divide-and-Conquer Parallel-Prefix Algorithm x x x x x n- x n- : : : : Parallel prefix computation of size n/ Each vertical line represents a location in shared memory :n- :n- T(p) = T(p/) + T(p) log p In hardware, this is the basis for Brent-Kung carry-lookahead adder Fig 57 Parallel prefix computation using a divide-and-conquer scheme Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
18 Another Divide-and-Conquer Algorithm x x x x x n- x n- Parallel prefix computation on n/ even-indexed inputs T(p) = T(p/) + : : : : Parallel prefix computation on n/ odd-indexed inputs Each vertical line represents a location in shared memory :n- :n- Strictly optimal algorithm, but requires commutativity Fig 58 Another divide-and-conquer scheme for parallel prefix computation T(p) = log p Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
19 head 55 Ranking the Elements of a Linked List info next C F A E B D Terminal element Rank: 5 4 (or distance from terminal) Distance from head: Fig 59 Example linked list and the ranks of its elements List ranking appears to be hopelessly sequential; one cannot get to a list element except through its predecessor! head Fig 5 PRAM data structures representing a linked list and the ranking results 4 5 info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9 A B C D E F 4 5
20 List Ranking via Recursive Doubling Many problems that appear to be unparallelizable to the uninitiated are parallelizable; Intuition can be quite misleading! info next 4 4 head A B C 4 5 D E 5 F Fig 5 Element ranks initially and after each of the three iterations Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
21 Question: Which PRAM submodel is implicit in this algorithm? PRAM List Ranking Algorithm PRAM list ranking algorithm (via pointer jumping) Processor j, j < p, do {initialize the partial ranks} if next[ j ] = j then rank[ j ] := else rank[ j ] := endif while rank[next[head]] Processor j, j < p, do rank[ j ] := rank[ j ] + rank[next[ j ]] next[ j ] := next[next[ j ]] A 4 endwhile Answer: CREW head 4 5 If we do not want to modify the original list, we simply make a copy of it first, in constant time info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide B C D E F 5
22 56 Matrix Multiplication Sequential matrix multiplication for i = to m do for j = to m do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor endfor m m matrices i j PRAM solution with m processors: each processor does one multiplication (not very efficient) c ij := k= to m a ik b kj A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
23 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors Proc (i, j), i, j < m, do begin t := for k = to m do t := t + a ik b kj endfor c ij := t j end i Processors are numbered (i, j), instead of to m (m) steps: Time-optimal CREW model is implicit A B C = ij Fig 5 PRAM matrix multiplication; p = m processors Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
24 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors for j = to m Proc i, i < m, do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor i j (m ) steps: Time-optimal CREW model is implicit Because the order of multiplications is immaterial, accesses to B can be skewed to allow the EREW model A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
25 PRAM Matrix Multiplication with Fewer Processors Algorithm is similar, except that each processor is in charge of computing m / p rows of C (m /p) steps: Time-optimal EREW model can be used A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access This is particularly costly for NUMA shared-memory machines j i A B = C ij m / p rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
26 More Efficient Matrix Multiplication (for NUMA) Partition the matrices into p square blocks p p q q=m/ p q=m/ p A C B D One processor computes these elements of C that it holds in local memory E F = AE+BG AF+BH G H CE+DG CF+DH Block matrix multiplication follows the same algorithm as simple matrix multiplication Block-band j p p Blockband i = Block ij Fig 5 Partitioning the matrices for block matrix multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
27 Details of Block Matrix Multiplication Element of block (i, k) in matrix A iq iq + iq + a iq + q - kq + c kq + c iq iq + iq + a iq + q - jq jq + jq + b jq + q - Multiply jq jq + jq + b jq + q - Add Elements of block (k, j) in matrix B Elements of block (i, j) in matrix C A multiply-add computation on q q blocks needs q = m /p memory accesses and q arithmetic operations So, q arithmetic operations are done per memory access Fig 54 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
28 6A More Shared-Memory Algorithms Develop PRAM algorithm for more complex problems: Searching, selection, sorting, other nonnumerical tasks Must present background on the problem in some cases Topics in This Chapter 6A Parallel Searching Algorithms 6A Sequential Rank-Based Selection 6A A Parallel Selection Algorithm 6A4 A Selection-Based Sorting Algorithm 6A5 Alternative Sorting Algorithms 6A6 Convex Hull of a D Point Set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
29 6A Parallel Searching Algorithms Searching an unordered list in PRAM Sequential time: n worst-case n/ on average Divide the list of n items into p segments of n / p items (last segment may have fewer) Processor i will be in charge of n / p list elements, beginning at address i n / p Parallel time: n / p worst-case??? on average Perfect speed-up of p with p processors? Pre- and postprocessing overheads Example: n = 4, p = A A A A Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
30 Parallel (p + )-ary Search on PRAM p probes, rather than, per step log p+ (n + ) = log (n + ) / log (p + ) = (log n / log p) steps Speedup log p Optimal: no comparison-based Example: search algorithm can be faster n = 6 A single search in a sorted list can t be significantly speeded p = up through parallel processing, but all hope is not lost: 8 7 Example: n = 6, p = P P P P From Sec 8 P P Dynamic data (sorting overhead) Batch searching (multiple lookups) 5 Step Step Step Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
31 6A Sequential Ranked-Based Selection Selection: Find the (or a) kth smallest among n elements Example: 5th smallest element in the following list is : Naive solution through sorting, O(n log n) time q Median k < L < m = m L E n m n/q But linear-time sequential algorithm can be developed m = the median of the medians: < n/4 elements > n/4 elements > m G k > L + E max( L, G ) n/4 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
32 Linear-Time Sequential Selection Algorithm O(n) T(n/q) O(n) To be justified T(n/4) Sequential rank-based selection algorithm select(s, k) if S < q {q is a small constant} then sort S and return the kth smallest element of S else divide S into S /q subsequences of size q Sort each subsequence and find its median Let the S /q medians form the sequence T endif m = select(t, T /) {find the median m of the S /q medians} Create subsequences L: Elements of S Median q that are < m E: Elements of S that are = m G: Elements of S that are > m 4 if L k then return select(l, n k) else if L + E k then return m m m = the median of the medians: < n/4 elements > n/4 elements else return select(g, k L E ) endif endif k < L n/q < m = m > m L E G k > L + E Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
33 Algorithm Complexity and Examples T(n) =T(n / q) +T(n/4) + cn We must have q 5; for q = 5, the solution is T(n) = cn n/q sublists of q elements S T 6 5 m L E G L = 7 E = G = 6 To find the 5th smallest element in S, select the 5th smallest element in L S T m L E G Answer: The 9th smallest element of S is The th smallest element of S is found by selecting the 4th smallest element in G Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
34 O(n x ) T(n x,p) O(n x ) T(n/4, p) 6A A Parallel Selection Algorithm Parallel rank-based selection algorithm PRAMselect(S, k, p) if S < 4 then sort S and return the kth smallest element of S else broadcast S to all p processors divide S into p subsequences S(j) of size S /p Processor j, j < p, compute T j := select(s(j), S(j) /) endif m = PRAMselect(T, T /, p) {median of the medians} Broadcast m to all processors and create subsequences L: Elements of S that are < m Median E: Elements of Sq that are = m G: Elements of S that are > m 4 if L k then return PRAMselect(L, k, p) n else if L + E k Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4 m k < L m = the median then return m of the medians: < n/4 elements else return PRAMselect(G, k > L n/4 elements E, p) endif endif Let p =O(n x ) n/q < m = m > m L E G k > L + E
35 Algorithm Complexity and Efficiency T(n,p) =T(n x,p) +T(n/4,p) + cn x The solution is O(n x ); verify by substitution Speedup = (n)/o(n x ) = (n x ) = (p) Efficiency = () Work(n, p) = pt(n, p) = (n x ) (n x ) = (n) Remember p =O(n x ) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x ) = O(n) What happens if we set x to? (ie, use n processors) T(n, n) = O(n x ) = O()? No, because in asymptotic analysis, we ignored several O(log n) terms compared with O(n x ) terms Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
36 Data Movement in Step of the Algorithm n x a items a b b items < m L n c a b c items = m E c > m G Consider the sublist L: Processor i contributes a i items to this sublist Processor starts storing at location, processor at location a, processor at location a + a, Processor at location a + a + a, Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
37 O() 6A4 A Selection-Based Sorting Algorithm O(n x ) O(n x ) T(n/k,p/k) T(n/k,p/k) Parallel selection-based sort PRAMselectionsort(S, p) if S < k then return quicksort(s) for i = to k do m j := PRAMselect(S, i S /k, p) {let m := ; m k := + } endfor for i = to k do make the sublist T(i) from elements of S in (m i, m i+ ) endfor 4 for i = to k/ do in parallel PRAMselectionsort(T(i), p/k) {p/(k/) proc s used for each of the k/ subproblems} endfor 5 for i = k/ + to k do in parallel PRAMselectionsort(T(i), p/k) endfor Let p = n x and k = /x n/k elements n/k n/k n/k m m m m + k Fig 6 Partitioning of the sorted list for selection-based sorting Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
38 Algorithm Complexity and Efficiency T(n, p) = T(n/k, p/k) + cn x The solution is O(n x log n); verify by substitution Speedup(n, p) = (nlogn)/o(n x logn) = (n x ) = (p) Efficiency = speedup / p = () Work(n, p) = pt(n, p) = (n x ) (n x log n) = (n log n) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x log n) = O(n log n) Remember p =O(n x ) Our asymptotic analysis is valid for x > but not for x = ; ie, PRAMselectionsort cannot sort p keys in optimal O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
39 Example of Parallel Sorting S: Threshold values for k = 4 (ie, x = ½ and p = n / processors): m = n/k = 5/4 6 m = PRAMselect(S, 6, 5) = n/k = 5/4 m = PRAMselect(S,, 5) = 4 n/k = 75/4 9 m = PRAMselect(S, 9, 5) = 6 m 4 = + m m m m m 4 T: T: Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
40 6A5 Alternative Sorting Algorithms Sorting via random sampling (assume p << n) Given a large list S of inputs, a random sample of the elements can be used to find k comparison thresholds It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor Parallel randomized sort PRAMrandomsort(S, p) Processor j, j < p, pick S /p random samples of its S /p elements and store them in its corresponding section of a list T of length S /p Processor sort the list T {comparison threshold m i is the (i S /p )th element of T} Processor j, j < p, store its elements falling in (m i, m i+ ) into T(i) 4 Processor j, j < p, sort the sublist T(j) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
41 Parallel Radixsort In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the LSB In Step i, bit i is examined, i < k Records are stably sorted by the value of the ith key bit Binary forms Question: How are the data movements performed? Input Sort by Sort by Sort by list LSB middle bit MSB 5 () 4 () 4 () () 7 () () 5 () () () () () () () 5 () () () 4 () 7 () () 4 () () () 7 () 5 () 7 () () () 7 () () 7 () 7 () 7 () Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
42 Data Movements in Parallel Radixsort Input Compl Diminished Prefix sums Shifted list of bit prefix sums Bit plus list 5 () + = 4 () 7 () + = 4 () () + = 5 () () 4 + = 6 5 () 4 () 7 () () () 7 () 5 + = 7 () () 7 () Running time consists mainly of the time to perform k parallel prefix computations: O(log p) for k constant Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
43 6A6 Convex Hull of a D Point Set Best sequential algorithm for p points: (p log p) steps y 7 y Angle Point Set Q 4 5 x y Fig 6 Defining the convex hull problem 7 8 CH(Q) 5 4 x x All points fall on this side of line 5 Fig 6 Illustrating the properties of the convex hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4
44 PRAM Convex Hull Algorithm Parallel convex hull algorithm PRAMconvexhull(S, p) Sort point set by x coordinates Divide sorted list into p subsets Q (i) of size p, i < p Find convex hull of each subset Q (i) using p processors 4 Merge p convex hulls CH(Q (i) ) into overall hull CH(Q) y y Upper hull Q () Q () Q () Q () x () CH(Q ) Lower hull Fig 64 Multiway divide and conquer for the convex hull problem Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 44
45 Merging of Partial Convex Hulls (j) CH(Q ) Tangent lines (i) CH(Q ) (a) No point of CH(Q(i)) is on CH(Q) A B (k) CH(Q ) Tangent lines are found through binary search in log time Analysis: T(p, p) = T(p /, p / ) + c log p c log p (j) CH(Q ) (i) CH(Q ) (k) CH(Q ) The initial sorting also takes O(log p) time (b) Points of CH(Q(i)) from A to B are on CH(Q) Fig 65 Finding points in a partial hull that belong to the combined hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 45
46 6B Implementation of Shared Memory Main challenge: Easing the memory access bottleneck Providing low-latency, high-bandwidth paths to memory Reducing the need for access to nonlocal memory Reducing conflicts and sensitivity to memory latency Topics in This Chapter 6B Processor-Memory Interconnection 6B Multistage Interconnection Networks 6B Cache Coherence Protocols 6B4 Data Allocation for Conflict-Free Access 6B5 Distributed Shared Memory 6B6 Methods for Memory Latency Hiding Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 46
47 About the New Chapter 6B This new chapter incorporates material from the following existing sections of the book: 66 Some Implementation Aspects 44 Dimension-Order Routing 5 Plus-or-Minus- i Network 66 Multistage Interconnection Networks 7 Distributed Shared Memory 8 Data Access Problems and Caching 8 Cache Coherence Protocols 8 Multithreading and Latency Hiding Coordination and Synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 47
48 Making PRAM Practical PRAM needs a read and a write access to memory in every cycle Even for a sequential computer, memory is tens to hundreds of times slower than arithmetic/logic operations; multiple processors accessing a shared memory only makes the situation worse Shared access to a single large physical memory isn t scalable Strategies and focal points for making PRAM practical Make memory accesses faster and more efficient (pipelining) Reduce the number of memory accesses (caching, reordering of accesses so as to do more computation per item fetched/stored) Reduce synchronization, so that slow memory accesses for one computation do not slow down others (synch, memory consistency) 4 Distribute the memory and data to make most accesses local 5 Store data structures to reduce access conflicts (skewed storage) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 48
49 6B Processor-Memory Interconnection Processors Memory modules Processorto-processor network Processorto-memory network p- m- Parallel I/O Options: Crossbar Bus(es) MIN Bottleneck Complex Expensive Fig 4 A parallel processor with global (shared) memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 49
50 Processor-to-Memory Network P P P P P 4 P 5 P 6 P 7 Crossbar switches offer full permutation capability (they are nonblocking), but are complex and expensive: O(p ) Even with a permutation network, full PRAM functionality is not realized: two processors cannot access different addresses in the same memory module Practical processor-tomemory networks cannot M M M M M 4 M 5 M 6 M 7 realize all permutations An 8 8 crossbar switch (they are blocking) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
51 Bus-Based Interconnections Single-bus system: Bandwidth bottleneck Bus loading limit Scalability: very poor Single failure point Conceptually simple Forced serialization Multiple-bus system: Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Simple serialization Memory I/O Proc Proc Proc Proc Memory I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
52 Back-of-the-Envelope Bus Bandwidth Calculation Single-bus system: Bus frequency: 5 GHz Data width: 56 b ( B) Mem Access: bus cycles (5G)/ = 8 GB/s Bus cycle = ns Memory cycle = ns mem cycle = 5 bus cycles Multiple-bus system: Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less) Memory I/O Proc Proc Proc Proc Mem Mem Mem Mem I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
53 Hierarchical Bus Interconnection Bus switch (Gateway) Low-level cluster Heterogeneous system Fig 49 Example of a hierarchical interconnection architecture Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5
54 Removing the Processor-to-Memory Bottleneck Processors Caches Memory modules Processorto-processor network Challenge: Cache coherence Processorto-memory network Fig 44 p- m- Parallel I/O A parallel processor with global memory and processor caches Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 54
55 Why Data Caching Works Hit rate r (fraction of memory accesses satisfied by cache) C eff = C fast + ( r)c slow Cache parameters: Size Block length (line width) Placement policy Replacement policy Write policy Address in Tag Index Block address Word offset Placement Option Placement Option State bits State bits Tag --- One cache block --- Tag --- One cache block --- The two candidate words and their tags are read out Fig 8 Data storage and access in a two-way set-associative cache Compare Compare = = Cache miss Mux Data out Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 55
56 Benefits of Caching Formulated as Amdahl s Law Hit rate r (fraction of memory accesses satisfied by cache) Access cabinet in s Access drawer in 5 s C eff = C fast + ( r)c slow S = C slow / C eff Register file Access desktop in s Cache memory = ( r) + Cfast /C slow Main memory Fig 8 of Parhami s Computer Architecture text (5) This corresponds to the miss-rate fraction r of accesses being unaffected and the hit-rate fraction r (almost ) being speeded up by a factor C slow /C fast Generalized form of Amdahl s speedup formula: S = /(f /p + f /p + + f m /p m ), with f + f + + f m = In this case, a fraction r is slowed down by a factor (C slow + C fast )/C slow, and a fraction r is speeded up by a factor C slow / C fast Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 56
57 6B Multistage Interconnection Networks Fig 5 Key elements of the Cray Y-MP processor Address registers, address function units, instruction buffers, and control not shown The Vector-Parallel Cray Y-MP Computer Inter- Processor Commun Central Memory Input/Output Vector Registers 6 6 CPU V V7 V6 V5 V4 V V V T Registers (8 64-bit) S Registers (8 64-bit) From address registers/units Vector Integer Units Add Shift Logic Weight/ Parity Floating- Point Add Units Multiply Reciprocal Approx Scalar Integer Units Add Shift Logic Weight/ Parity Winter 6 Parallel Processing, Shared-Memory Parallelism Slide bit 64 bit bit
58 Cray Y-MP s Interconnection Network P P Sections Subsections Memory Banks 8, 4, 8,, 8, 6, 4,, P P P 4 P 5 P 6 P , 5, 9,, 9, 6,,,, 7,,, 7,,, 55 Fig 6 The processor-to-memory interconnection network of Cray Y-MP Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 58
59 p Processors Butterfly Processor-to-Memory Network Fig 69 Example of a multistage memory access network log p Columns of -by- Switches p Memory Banks Two ways to use the butterfly: - Edge switches x and x - All switches x Not a full permutation network (eg, processor cannot be connected to memory bank alongside the two connections shown) Is self-routing: ie, the bank address determines the route A request going to memory bank ( ) is routed: lower upper upper Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 59
60 p Processors Butterfly as Multistage Interconnection Network log p Columns of -by- Switches p Memory Banks log p + Columns of -by- Switches Fig 69 Example of a multistage memory access network Generalization of the butterfly network High-radix or m-ary butterfly, built of m m switches Has m q rows and q + columns (q if wrapped) Fig 58 Butterfly network used to connect modules that are on the same side Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
61 Self-Routing on a Butterfly Network Fig 46 Example dimension-order routing paths dim dim dim Ascend Descend 4 4 Number of cross links taken = length of path in hypercube From node to 6: routing tag = = cross-straight-cross From node to 5: routing tag = = straight-cross-cross From node 6 to : routing tag = = cross-cross-cross Winter 6 Parallel Processing, Shared-Memory Parallelism Slide
62 Butterfly Is Not a Permutation Network dim dim dim A dim dim dim A B C B D 4 C D Fig 47 Packing is a good routing problem for dimensionorder routing on the hypercube 7 7 Fig 48 Bit-reversal permutation is a bad routing problem for dimensionorder routing on the hypercube 7 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
63 Structure of Butterfly Networks Dim Dim Dim Dim Dim Dim Dim Switching these two row pairs converts this to the original butterfly network Changing the order of stages in a butterfly is thus equi valent to a relabeling of the rows (in this example, row xyz becomes row xzy) Fig 55 Butterfly network with permuted dimensions The 6-row butterfly network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6
64 Beneš Network Processors log p Columns of -by- Switches 4 Memory Banks Fig 59 Beneš network formed from two back-to-back butterflies A q -row Beneš network: Can route any q q permutation It is rearrangeable Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 64
65 Routing Paths in a Beneš Network To which memory modules can we connect proc 4 without rearranging the other paths? What about proc 6? q q+ q+ Inputs Rows, q + Columns Outputs Fig 5 Another example of a Beneš network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 65
66 Augmented Data Manipulator Network a b Data manipulator network was used in Goodyear MPP, an early SIMD parallel machine Augmented means that switches in a column are independent, as opposed to all being set to same state (simplified control) q Rows Fig 5 Augmented data manipulator network 7 a b 7 q + Columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 66
67 Fat Trees P P P 4 P P 5 P P 6 Fig 56 Two representations of a fat tree Front view: Binary tree Side view: Inverted binary tree Skinny tree? Fig 57 Butterfly network redrawn as a fat tree P 7 P 8 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 67
68 The Sea of Indirect Interconnection Networks Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes We have already seen the butterfly, hierarchical bus, beneš, and ADM networks Fig 48 (modified) The sea of indirect interconnection networks Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 68
69 Self-Routing Permutation Networks Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network) Permutation routing through a MIN is the same problem as sorting 7 () () () () 4 () () 6 () () () () 5 () () () () () () Sort by MSB Sort by the middle bit Sort by LSB Fig 64 Example of sorting on a binary radix sort network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 69
70 Partial List of Important MINs Augmented data manipulator (ADM): aka unfolded PMI (Fig 5) Banyan: Any MIN with a unique path between any input and any output (eg butterfly) Baseline: Butterfly network with nodes labeled differently Beneš: Back-to-back butterfly networks, sharing one column (Figs 59-) Bidelta: A MIN that is a delta network in either direction Butterfly: aka unfolded hypercube (Figs 69, 54-5) Data manipulator: Same as ADM, but with switches in a column restricted to same state Delta: Any MIN for which the outputs of each switch have distinct labels (say and for switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output Flip: Reverse of the omega network (inputs outputs) Indirect cube: Same as butterfly or omega Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig 59) Permutation: Any MIN that can realize all permutations Rearrangeable: Same as permutation network Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
71 Proc- toproc network 6B Cache Coherence Protocols Processors p Caches w y w z x z Processorto-memory network Parallel I/O Memory modules w x Fig 8 Various types of cached data blocks in a parallel processor with global memory and processor caches y z w x y z Multiple consistent Single consistent Single inconsistent Invalid Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
72 Example: A Bus-Based Snoopy Protocol Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken CPU read hit, CPU write hit CPU read hit CPU write miss: Write back the block, Put write miss on bus Exclusive (read/write) CPU read miss: Write back the block, put read miss on bus Bus read miss for this block: Write back the block CPU write hit/miss: Put write miss on bus Bus write miss for this block: Write back the block CPU read miss: Put read miss on bus Shared (read-only) CPU read miss: Put read miss on bus CPU write miss: Put write miss on bus Invalid Bus write miss for this block Fig 8 Finite-state control mechanism for a bus-based snoopy cache coherence protocol Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
73 Implementing a Snoopy Protocol A second tags/state storage unit allows snooping to be done concurrently with normal cache operation Getting all the implementation timing and details right is nontrivial Duplicate tags and state store for snoop side =? =? Snoop state Snoop side cache control Cmd Tags State Tag Cache data array Addr CPU Cmd Addr Buffer Buffer Addr Cmd Fig 77 of Parhami s Computer Architecture text Main tags and state store for processor side Processor side cache control System bus Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7
74 Scalable (Distributed) Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 74
75 Example: A Directory-Based Protocol Write miss: Fetch data value, request invalidation, return data value, sharing set = {c} Read miss: Return data value, sharing set = sharing set + {c} Exclusive (read/write) Read miss: Fetch data value, return data value, Read miss: sharing Fetch, set return = sharing data set value, + {c} sharing set = {c} Correction to text Write miss: Invalidate, sharing set = {c}, return data value Shared (read-only) Data write-back: Sharing set = { } Write miss: Return data value, sharing set = {c} Uncached Read miss: Return data value, sharing set = {c} Fig 84 States and transitions for a directory entry in a directory-based coherence protocol (c denotes the cache sending the message) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 75
76 Implementing a Directory-Based Protocol Sharing set implemented as a bit-vector (simple, but not scalable) When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory Processor Processor Processor Processor Cache Cache Cache Cache Head pointer Memory Coherent data block Noncoherent data blocks The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 86 in connection with the SCI standard) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 76
77 6B4 Data Allocation for Conflict-Free Access Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping Row,,,, 4, 5,,,,, 4, 5, Column,,,, 4, 5,,,,, 4, 5,,4,4,4,4 4,4 5,4,5,5,5,5 4,5 5,5 Module 4 5 Each matrix column is stored in a different memory module (bank) Accessing a column leads to conflicts Fig 66 Matrix storage in column-major order to allow concurrent accesses to rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 77
78 Row Column,,5,4, 4, 5, Skewed Storage Format,,,5,4 4, 5,,,,,5 4,4 5,,,,, 4,5 5,4,4,,, 4, 5,5,5,4,, 4, 5, Module 4 5 Fig 67 Skewed matrix storage for conflict-free accesses to rows and columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 78
79 A Unified Theory of Conflict-Free Access Vector indices A ij is viewed as vector element i + jm Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector A qd array can be viewed as a vector, with row / column accesses associated with constant strides Column: k, k+, k+, k+, k+4, k+5 Stride = Row: k, k+m, k+m, k+m, k+4m, k+5m Stride = m Diagonal: k, k+m+, k+(m+), k+(m+), k+4(m+), k+5(m+) Stride = m + Antidiagonal: k, k+m, k+(m ), k+(m ), k+4(m ), k+5(m ) Stride = m Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 79
80 Linear Skewing Schemes Vector indices A is viewed as vector element i + jm ij Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector Place vector element i in memory bank a + bi mod B (word address within bank is irrelevant to conflict-free access; also, a can be set to ) With a linear skewing scheme, vector elements k, k + s, k +s,, k + (B )s will be assigned to different memory banks iff sb is relatively prime with respect to the number B of memory banks A prime value for B ensures this condition, but is not very practical Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
81 6B5 Distributed Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
82 Butterfly-Based Distributed Shared Memory Randomized emulation of the p-processor PRAM on p-node butterfly Use hash function to map memory locations to modules p locations p modules, not necessarily distinct One of p = q (q + ) processors With high probability, at most O(log p) of the p locations will be in modules located in the same row Average slowdown = O(log p) P M P M P M P M dim dim dim q + Columns Fig 7 Butterfly distributed-memory machine emulating the PRAM Memory module holding m/p memory locations Each node = router + processor + memory q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
83 PRAM Emulation with Butterfly MIN Emulation of the p-processor PRAM on (p log p)-node butterfly, with memory modules and processors connected to the two sides; O(log p) avg slowdown q One of p = processors Less efficient than Fig 7, which uses a smaller butterfly By using p / (log p) physical processors to emulate the p-processor PRAM, this new emulation scheme becomes quite efficient (pipeline the memory accesses of the log p virtual processors assigned to each physical processor) P dim dim dim q + Columns Fig 7 Distributed-memory machine, with a butterfly multistage interconnection network, emulating the PRAM M Memory module holding m/p memory locations q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8
84 Deterministic Shared-Memory Emulation Deterministic emulation of p-processor PRAM on p-node butterfly One of p = q (q + ) processors Store log m copies of each of the m memory location contents Time-stamp each updated value A write is complete once a majority of copies are updated A read is satisfied when a majority of copies are accessed and the one with latest time stamp is used Why it works: A few congested links won t delay the operation P M P M P M P M dim dim dim q + Columns Write set log m copies Memory module holding m/p memory locations Each node = router + processor + memory q Rows Read set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 84
85 PRAM Emulation Using Information Dispersal Instead of (log m)-fold replication of data, divide each data element into k pieces and encode the pieces using a redundancy factor of, so that any k / pieces suffice for reconstructing the original data Original data word and its k pieces Possible read set of size k/ Up-to-date pieces Recunstruction algorithm Possible update set of size k/ The k pieces after encoding (approx three times larger) Original data word recovered from k/ encoded pieces Fig 74 Illustrating the information dispersal approach to PRAM emulation with lower data redundancy Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 85
86 6B6 Methods for Memory Latency Hiding By assumption, PRAM accesses memory locations right when they are needed, so processing must stall until data is fetched Method : Predict accesses (prefetch) Method : Pipeline multiple accesses Processors Not a smart strategy: Memory access time = s times that of add time Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 86
87 6C Shared-Memory Abstractions A precise memory view is needed for correct algorithm design Sequential consistency facilitates programming Less strict consistency models offer better performance Topics in This Chapter 6C Atomicity in Memory Access 6C Strict and Sequential Consistency 6C Processor Consistency 6C4 Weak or Synchronization Consistency 6C5 Other Memory Consistency Models 6C6 Transactional Memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 87
88 6C Atomicity in Memory Access Performance optimization and latency hiding often imply that memory accesses are interleaved and perhaps not serviced in the order issued Processors Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 88
89 Barrier Synchronization Overhead P P P P P P P P Synchronization overhead Given that AND is a semigroup computation, it is only a small step to generalize it to a more flexible global combine operation Time Done Done Reduction of synchronization overhead: Providing hardware aid to do it faster Using less frequent synchronizations Fig The performance benefit of less frequent synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 89
90 Synchronization via Message Passing Vertex v represents j j Task task or Computation computation jj Task interdependence is often more complicated than the simple prerequisite structure thus far considered x T p T Latency p Latency with with p pro p p T T Number Number of of nodes T T Depth Depth of the of the graph Schematic representation of data dependence A B x x Details of dependence Communication { latency t t t Time Process A: send x Process B: receive x B waits Fig Automatic synchronization in message-passing systems y 4 Output Output 6 9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
91 Synchronization with Shared Memory Accomplished by accessing specially designated shared control variables The fetch-and-add instruction constitutes a useful atomic operation If the current value of x is c, fetch-and-add(x, a) returns c to the process and overwrites x = c with the value c + a A second process executing fetch-and-add(x, b) then gets the now current value c + a and modifies it to c + a + b Why atomicity of fetch-and-add is important: With ordinary instructions, the steps of fetch-and-add for A and B may be interleaved as follows: Process A Process B Comments Time step read x A s accumulator holds c Time step read x B s accumulator holds c Time step add a A s accumulator holds c + a Time step 4 add b B s accumulator holds c + b Time step 5 store x x holds c + a Time step 6 store x x holds c + b (not c + a + b) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
92 Barrier Synchronization: Implementations Make each processor, in a designated set, wait at a barrier until all other processors have arrived at the corresponding points in their computations Software implementation via fetch-and-add or similar instruction Hardware implementation via an AND tree (raise flag, check AND result) A problem with the AND-tree: If a processor can be randomly delayed between raising it flag and checking the tree output, some processors might cross the barrier and lower their flags before others have noticed the change in the AND tree output Solution: Use two AND trees for alternating barrier points fo fo fo fop fe fe fe fep Set AND tree Reset AND tree S Q Flipflop R Barrier Signal Fig 4 Example of hardware aid for fast barrier synchronization [Hoar96] Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
93 6C Strict and Sequential Consistency Strict consistency: A read operation always returns the result of the latest write operation on that data object Strict consistency is impossible to maintain in a distributed system which does not have a global clock While clocks can be synchronized, there is always some error that causes trouble in near-simultaneous operations Example: Three processes sharing variables -4 (r = read, w = write) Process A Process B Process C w r w w r r w4 r r r w r5 r w w4 r4 r r Real time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9
94 Sequential Consistency Sequential consistency (original def): The result of any execution is the same as if processor operations were executed in some sequential order, and the operations of a particular processor appear in the sequence specified by the program it runs Process A Process B w r w w r r w4 r r r w Process C r5 r w w4 r4 r r Time A possible ordering A possible ordering w4w r r5 r r w r r ww4w r4 w r r r r w4 r5 r w r r w r ww4w r4 r r r r r w Sequential consistency (new def): Write operations on the same data object are seen in exactly the same order by all system nodes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 94
95 The Performance Penalty of Sequential Consistency Initially Thread Thread X = Y = X := Y := R := Y R :=X Exec Exec Exec X := Y := X := R := Y R :=X Y := Y := X := R := Y R := X R := Y R := X If a compiler reorders the seemingly independent statements in Thread, the desired semantics (R and R not being both ) is compromised Relaxed consistency (memory model): Ease the requirements on doing things in program order and/or write atomicity to gain performance When maintaining order is absolutely necessary, we use synchronization primitives to enforce it Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 95
Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1
Part II Extreme Models Fall 8 Parallel Processing, Extreme Models Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms
More informationCSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms
Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop
More informationPart II. Extreme Models. Spring 2006 Parallel Processing, Extreme Models Slide 1
Part II Extreme Models Spring 6 Parallel Processing, Extreme Models Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationParallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms
Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationCOMP Parallel Computing. PRAM (4) PRAM models and complexity
COMP 633 - Parallel Computing Lecture 5 September 4, 2018 PRAM models and complexity Reading for Thursday Memory hierarchy and cache-based systems Topics Comparison of PRAM models relative performance
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationCS256 Applied Theory of Computation
CS256 Applied Theory of Computation Parallel Computation IV John E Savage Overview PRAM Work-time framework for parallel algorithms Prefix computations Finding roots of trees in a forest Parallel merging
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationThe PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2
The PRAM model A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2 Introduction The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer. A PRAM consists of
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationFundamental Algorithms
Fundamental Algorithms Chapter 6: Parallel Algorithms The PRAM Model Jan Křetínský Winter 2017/18 Chapter 6: Parallel Algorithms The PRAM Model, Winter 2017/18 1 Example: Parallel Sorting Definition Sorting
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationF. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES
F. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES MORGAN KAUFMANN PUBLISHERS SAN MATEO, CALIFORNIA Contents Preface Organization of the Material Teaching
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCS 1013 Advance Computer Architecture UNIT I
CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationPart V. Some Broad Topic. Winter 2016 Parallel Processing, Some Broad Topics Slide 1
Part V Some Broad Topic Winter 06 Parallel Processing, Some Broad Topics Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing:
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationPart VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1
Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture:
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationFundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.
Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationMultiprocessor Interconnection Networks
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationParallel Random-Access Machines
Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69 Plan 1 The PRAM Model 2 Performance
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationCS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14
CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationSorting (Chapter 9) Alexandre David B2-206
Sorting (Chapter 9) Alexandre David B2-206 1 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =
More informationParallel Architectures
Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationThe PRAM Model. Alexandre David
The PRAM Model Alexandre David 1.2.05 1 Outline Introduction to Parallel Algorithms (Sven Skyum) PRAM model Optimality Examples 11-02-2008 Alexandre David, MVP'08 2 2 Standard RAM Model Standard Random
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationThe levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms
The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More information