Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1

Size: px

Start display at page:

Download "Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1"

Coleen Hillary Sullivan
6 years ago
Views:

Implementation of Shared Memory 6C, Shared-Memory

1 Part II Shared-Memory Parallelism Part II Shared-Memory Parallelism Shared-Memory Algorithms Implementations and Models 5 PRAM and Basic Algorithms 6A More Shared-Memory Algorithms 6B Implementation of Shared Memory 6C, Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 999, ISBN ) It was prepared by the author in connection with teaching the graduate-level course ECE 54B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara Instructors can use these slides in classroom teaching and for other educational purposes Any other use is strictly prohibited Behrooz Parhami Edition Released Revised Revised Revised First Spring 5 Spring 6 Fall 8 Fall Winter Winter 4 Winter 6 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

3 II Shared-Memory Parallelism Shared memory is the most intuitive parallel user interface: Abstract SM (PRAM); ignores implementation issues Implementation w/o worsening the memory bottleneck Shared-memory models and their performance impact Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6A More Shared-Memory Algorithms Chapter 6B Implementation of Shared Memory Chapter 6C Shared-Memory Abstractions Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

4 5 PRAM and Basic Algorithms PRAM, a natural extension of RAM (random-access machine): Present definitions of model and its various submodels Develop algorithms for key building-block computations Topics in This Chapter 5 PRAM Submodels and Assumptions 5 Data Broadcasting 5 Semigroup or Fan-in Computation 54 Parallel Prefix Computation 55 Ranking the Elements of a Linked List 56 Matrix Multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

5 Why Start with Shared Memory? Study one extreme of parallel computation models: Abstract SM (PRAM); ignores implementation issues This abstract model is either realized or emulated In the latter case, benefits are similar to those of HLLs In Part II, we will study the other extreme case of models: Concrete circuit model; incorporates hardware details Allows explicit latency/area/energy trade-offs Facilitates theoretical studies of speed-up limits Everything else falls between these two extremes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

6 Processors 5 PRAM Submodels and Assumptions p Shared Memory Fig 46 Conceptual view of a parallel random-access machine (PRAM) m Processor i can do the following in three phases of one cycle: Fetch a value from address s i in shared memory Perform computations on data held in local registers Store a value into address d i in shared memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

7 Types of PRAM Reads from same location Exclusive Concurrent Writes to same location Exclusive Concurrent EREW Least powerful, most realistic ERCW Not useful CREW Default CRCW Most powerful, further subdivided Fig 5 Submodels of the PRAM model Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

8 Types of CRCW PRAM CRCW submodels are distinguished by the way they treat multiple writes: Undefined: Detecting: Common: Random: Priority: Max / Min: Reduction: The value written is undefined (CRCW-U) A special code for detected collision is written (CRCW-D) Allowed only if they all store the same value (CRCW-C) [ This is sometimes called the consistent-write submodel ] The value is randomly chosen from those offered (CRCW-R) The processor with the lowest index succeeds (CRCW-P) The largest / smallest of the values is written (CRCW-M) The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

9 Power of CRCW PRAM Submodels Model U is more powerful than model V if T U (n)=o(t V (n)) for some problem EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P Theorem 5: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list Sort the list of addresses in ascending order Remove all duplicate addresses Access data at desired addresses Replicate data via parallel prefix computation Each of these steps requires constant or O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

10 Implications of the CRCW Hierarchy of Submodels EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor (log p) Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown Efficient parallel algorithms have polylogarithmic running times Running time still polylogarithmic after slowdown due to emulation We need not be too concerned with the CRCW submodel used Simply use whichever submodel is most natural or convenient Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

11 Some Elementary PRAM Computations Initializing an n-vector (base address = B) to all s: p elements for j = to n/p processor i do if jp + i < n then M[B + jp + i] := endfor n / p segments Adding two n-vectors and storing the results in a third (base addresses B, B, B) Convolution of two n-vectors: W k = i+j=k U i V j (base addresses B W, B U, B V ) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

12 5 Data Broadcasting Making p copies of B[] by recursive doubling for k = to log p Proc j, j < p, do Copy B[j] into B[j + k ] endfor Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded B Fig 5 Data broadcasting in EREW PRAM via recursive doubling B Fig 5 EREW PRAM data broadcasting without redundant copying Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

13 Class Participation: Broadcast-Based Sorting Each person write down an arbitrary nonnegative integer with or fewer digits on a piece of paper Students take turn broadcasting their numbers by calling them out aloud Each student puts an X on paper for every number called out that is smaller than his/her own number, or is equal but was called out before the student s own value j p Each student counts the number of Xs on paper to determine the rank of his/her number Students call out their numbers in order of the computed rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

14 All-to-All Broadcasting on EREW PRAM EREW PRAM algorithm for all-to-all broadcasting Processor j, j < p, write own data value into B[j] for k = to p Processor j, j < p, do Read the data value in B[(j + k) mod p] endfor This O(p)-step algorithm is time-optimal j p Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, j < p, write into R[ j ] for k = to p Processor j, j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + endif endfor Processor j, j < p, write S[ j ] into S[R[ j ]] This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

15 5 Semigroup or Fan-in Computation EREW PRAM semigroup computation algorithm Proc j, j < p, copy X[j] into S[j] s := while s <p Proc j, j < p s, do S[j + s] := S[j] S[j + s] s := s endwhile Broadcast S[p ] to all processors This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not If we use p processors on a list of size n = O(p log p), then optimal speedup can be achieved S Fig 54 Semigroup computation in EREW PRAM Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5 : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 55 Intuitive justification of why parallel slack helps improve the efficiency : : : : :4 :5 :6 :7 :8 :9 Lower degree of parallelism near the root Higher degree of parallelism near the leave

16 54 Parallel Prefix Computation Same as the first part of semigroup computation (no final broadcasting) S : : : : 4:4 5:5 6:6 7:7 8:8 9:9 : : : : :4 4:5 5:6 6:7 7:8 8:9 : : : : :4 :5 :6 4:7 5:8 6:9 : : : : :4 :5 :6 :7 :8 :9 Fig 56 Parallel prefix computation in EREW PRAM via recursive doubling : : : : :4 :5 :6 :7 :8 :9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

17 A Divide-and-Conquer Parallel-Prefix Algorithm x x x x x n- x n- : : : : Parallel prefix computation of size n/ Each vertical line represents a location in shared memory :n- :n- T(p) = T(p/) + T(p) log p In hardware, this is the basis for Brent-Kung carry-lookahead adder Fig 57 Parallel prefix computation using a divide-and-conquer scheme Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

18 Another Divide-and-Conquer Algorithm x x x x x n- x n- Parallel prefix computation on n/ even-indexed inputs T(p) = T(p/) + : : : : Parallel prefix computation on n/ odd-indexed inputs Each vertical line represents a location in shared memory :n- :n- Strictly optimal algorithm, but requires commutativity Fig 58 Another divide-and-conquer scheme for parallel prefix computation T(p) = log p Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

19 head 55 Ranking the Elements of a Linked List info next C F A E B D Terminal element Rank: 5 4 (or distance from terminal) Distance from head: Fig 59 Example linked list and the ranks of its elements List ranking appears to be hopelessly sequential; one cannot get to a list element except through its predecessor! head Fig 5 PRAM data structures representing a linked list and the ranking results 4 5 info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9 A B C D E F 4 5

20 List Ranking via Recursive Doubling Many problems that appear to be unparallelizable to the uninitiated are parallelizable; Intuition can be quite misleading! info next 4 4 head A B C 4 5 D E 5 F Fig 5 Element ranks initially and after each of the three iterations Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

21 Question: Which PRAM submodel is implicit in this algorithm? PRAM List Ranking Algorithm PRAM list ranking algorithm (via pointer jumping) Processor j, j < p, do {initialize the partial ranks} if next[ j ] = j then rank[ j ] := else rank[ j ] := endif while rank[next[head]] Processor j, j < p, do rank[ j ] := rank[ j ] + rank[next[ j ]] next[ j ] := next[next[ j ]] A 4 endwhile Answer: CREW head 4 5 If we do not want to modify the original list, we simply make a copy of it first, in constant time info next rank Winter 6 Parallel Processing, Shared-Memory Parallelism Slide B C D E F 5

22 56 Matrix Multiplication Sequential matrix multiplication for i = to m do for j = to m do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor endfor m m matrices i j PRAM solution with m processors: each processor does one multiplication (not very efficient) c ij := k= to m a ik b kj A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

23 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors Proc (i, j), i, j < m, do begin t := for k = to m do t := t + a ik b kj endfor c ij := t j end i Processors are numbered (i, j), instead of to m (m) steps: Time-optimal CREW model is implicit A B C = ij Fig 5 PRAM matrix multiplication; p = m processors Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

24 PRAM Matrix Multiplication with m Processors PRAM matrix multiplication using m processors for j = to m Proc i, i < m, do t := for k = to m do t := t + a ik b kj endfor c ij := t endfor i j (m ) steps: Time-optimal CREW model is implicit Because the order of multiplications is immaterial, accesses to B can be skewed to allow the EREW model A B C = ij Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

25 PRAM Matrix Multiplication with Fewer Processors Algorithm is similar, except that each processor is in charge of computing m / p rows of C (m /p) steps: Time-optimal EREW model can be used A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access This is particularly costly for NUMA shared-memory machines j i A B = C ij m / p rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

26 More Efficient Matrix Multiplication (for NUMA) Partition the matrices into p square blocks p p q q=m/ p q=m/ p A C B D One processor computes these elements of C that it holds in local memory E F = AE+BG AF+BH G H CE+DG CF+DH Block matrix multiplication follows the same algorithm as simple matrix multiplication Block-band j p p Blockband i = Block ij Fig 5 Partitioning the matrices for block matrix multiplication Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

27 Details of Block Matrix Multiplication Element of block (i, k) in matrix A iq iq + iq + a iq + q - kq + c kq + c iq iq + iq + a iq + q - jq jq + jq + b jq + q - Multiply jq jq + jq + b jq + q - Add Elements of block (k, j) in matrix B Elements of block (i, j) in matrix C A multiply-add computation on q q blocks needs q = m /p memory accesses and q arithmetic operations So, q arithmetic operations are done per memory access Fig 54 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

28 6A More Shared-Memory Algorithms Develop PRAM algorithm for more complex problems: Searching, selection, sorting, other nonnumerical tasks Must present background on the problem in some cases Topics in This Chapter 6A Parallel Searching Algorithms 6A Sequential Rank-Based Selection 6A A Parallel Selection Algorithm 6A4 A Selection-Based Sorting Algorithm 6A5 Alternative Sorting Algorithms 6A6 Convex Hull of a D Point Set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

29 6A Parallel Searching Algorithms Searching an unordered list in PRAM Sequential time: n worst-case n/ on average Divide the list of n items into p segments of n / p items (last segment may have fewer) Processor i will be in charge of n / p list elements, beginning at address i n / p Parallel time: n / p worst-case??? on average Perfect speed-up of p with p processors? Pre- and postprocessing overheads Example: n = 4, p = A A A A Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

30 Parallel (p + )-ary Search on PRAM p probes, rather than, per step log p+ (n + ) = log (n + ) / log (p + ) = (log n / log p) steps Speedup log p Optimal: no comparison-based Example: search algorithm can be faster n = 6 A single search in a sorted list can t be significantly speeded p = up through parallel processing, but all hope is not lost: 8 7 Example: n = 6, p = P P P P From Sec 8 P P Dynamic data (sorting overhead) Batch searching (multiple lookups) 5 Step Step Step Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

31 6A Sequential Ranked-Based Selection Selection: Find the (or a) kth smallest among n elements Example: 5th smallest element in the following list is : Naive solution through sorting, O(n log n) time q Median k < L < m = m L E n m n/q But linear-time sequential algorithm can be developed m = the median of the medians: < n/4 elements > n/4 elements > m G k > L + E max( L, G ) n/4 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

32 Linear-Time Sequential Selection Algorithm O(n) T(n/q) O(n) To be justified T(n/4) Sequential rank-based selection algorithm select(s, k) if S < q {q is a small constant} then sort S and return the kth smallest element of S else divide S into S /q subsequences of size q Sort each subsequence and find its median Let the S /q medians form the sequence T endif m = select(t, T /) {find the median m of the S /q medians} Create subsequences L: Elements of S Median q that are < m E: Elements of S that are = m G: Elements of S that are > m 4 if L k then return select(l, n k) else if L + E k then return m m m = the median of the medians: < n/4 elements > n/4 elements else return select(g, k L E ) endif endif k < L n/q < m = m > m L E G k > L + E Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

33 Algorithm Complexity and Examples T(n) =T(n / q) +T(n/4) + cn We must have q 5; for q = 5, the solution is T(n) = cn n/q sublists of q elements S T 6 5 m L E G L = 7 E = G = 6 To find the 5th smallest element in S, select the 5th smallest element in L S T m L E G Answer: The 9th smallest element of S is The th smallest element of S is found by selecting the 4th smallest element in G Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

34 O(n x ) T(n x,p) O(n x ) T(n/4, p) 6A A Parallel Selection Algorithm Parallel rank-based selection algorithm PRAMselect(S, k, p) if S < 4 then sort S and return the kth smallest element of S else broadcast S to all p processors divide S into p subsequences S(j) of size S /p Processor j, j < p, compute T j := select(s(j), S(j) /) endif m = PRAMselect(T, T /, p) {median of the medians} Broadcast m to all processors and create subsequences L: Elements of S that are < m Median E: Elements of Sq that are = m G: Elements of S that are > m 4 if L k then return PRAMselect(L, k, p) n else if L + E k Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4 m k < L m = the median then return m of the medians: < n/4 elements else return PRAMselect(G, k > L n/4 elements E, p) endif endif Let p =O(n x ) n/q < m = m > m L E G k > L + E

35 Algorithm Complexity and Efficiency T(n,p) =T(n x,p) +T(n/4,p) + cn x The solution is O(n x ); verify by substitution Speedup = (n)/o(n x ) = (n x ) = (p) Efficiency = () Work(n, p) = pt(n, p) = (n x ) (n x ) = (n) Remember p =O(n x ) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x ) = O(n) What happens if we set x to? (ie, use n processors) T(n, n) = O(n x ) = O()? No, because in asymptotic analysis, we ignored several O(log n) terms compared with O(n x ) terms Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

36 Data Movement in Step of the Algorithm n x a items a b b items < m L n c a b c items = m E c > m G Consider the sublist L: Processor i contributes a i items to this sublist Processor starts storing at location, processor at location a, processor at location a + a, Processor at location a + a + a, Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

37 O() 6A4 A Selection-Based Sorting Algorithm O(n x ) O(n x ) T(n/k,p/k) T(n/k,p/k) Parallel selection-based sort PRAMselectionsort(S, p) if S < k then return quicksort(s) for i = to k do m j := PRAMselect(S, i S /k, p) {let m := ; m k := + } endfor for i = to k do make the sublist T(i) from elements of S in (m i, m i+ ) endfor 4 for i = to k/ do in parallel PRAMselectionsort(T(i), p/k) {p/(k/) proc s used for each of the k/ subproblems} endfor 5 for i = k/ + to k do in parallel PRAMselectionsort(T(i), p/k) endfor Let p = n x and k = /x n/k elements n/k n/k n/k m m m m + k Fig 6 Partitioning of the sorted list for selection-based sorting Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

38 Algorithm Complexity and Efficiency T(n, p) = T(n/k, p/k) + cn x The solution is O(n x log n); verify by substitution Speedup(n, p) = (nlogn)/o(n x logn) = (n x ) = (p) Efficiency = speedup / p = () Work(n, p) = pt(n, p) = (n x ) (n x log n) = (n log n) What happens if we set x to? ( ie, use one processor) T(n, ) = O(n x log n) = O(n log n) Remember p =O(n x ) Our asymptotic analysis is valid for x > but not for x = ; ie, PRAMselectionsort cannot sort p keys in optimal O(log p) time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

39 Example of Parallel Sorting S: Threshold values for k = 4 (ie, x = ½ and p = n / processors): m = n/k = 5/4 6 m = PRAMselect(S, 6, 5) = n/k = 5/4 m = PRAMselect(S,, 5) = 4 n/k = 75/4 9 m = PRAMselect(S, 9, 5) = 6 m 4 = + m m m m m 4 T: T: Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

40 6A5 Alternative Sorting Algorithms Sorting via random sampling (assume p << n) Given a large list S of inputs, a random sample of the elements can be used to find k comparison thresholds It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor Parallel randomized sort PRAMrandomsort(S, p) Processor j, j < p, pick S /p random samples of its S /p elements and store them in its corresponding section of a list T of length S /p Processor sort the list T {comparison threshold m i is the (i S /p )th element of T} Processor j, j < p, store its elements falling in (m i, m i+ ) into T(i) 4 Processor j, j < p, sort the sublist T(j) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

41 Parallel Radixsort In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the LSB In Step i, bit i is examined, i < k Records are stably sorted by the value of the ith key bit Binary forms Question: How are the data movements performed? Input Sort by Sort by Sort by list LSB middle bit MSB 5 () 4 () 4 () () 7 () () 5 () () () () () () () 5 () () () 4 () 7 () () 4 () () () 7 () 5 () 7 () () () 7 () () 7 () 7 () 7 () Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

42 Data Movements in Parallel Radixsort Input Compl Diminished Prefix sums Shifted list of bit prefix sums Bit plus list 5 () + = 4 () 7 () + = 4 () () + = 5 () () 4 + = 6 5 () 4 () 7 () () () 7 () 5 + = 7 () () 7 () Running time consists mainly of the time to perform k parallel prefix computations: O(log p) for k constant Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

43 6A6 Convex Hull of a D Point Set Best sequential algorithm for p points: (p log p) steps y 7 y Angle Point Set Q 4 5 x y Fig 6 Defining the convex hull problem 7 8 CH(Q) 5 4 x x All points fall on this side of line 5 Fig 6 Illustrating the properties of the convex hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 4

44 PRAM Convex Hull Algorithm Parallel convex hull algorithm PRAMconvexhull(S, p) Sort point set by x coordinates Divide sorted list into p subsets Q (i) of size p, i < p Find convex hull of each subset Q (i) using p processors 4 Merge p convex hulls CH(Q (i) ) into overall hull CH(Q) y y Upper hull Q () Q () Q () Q () x () CH(Q ) Lower hull Fig 64 Multiway divide and conquer for the convex hull problem Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 44

45 Merging of Partial Convex Hulls (j) CH(Q ) Tangent lines (i) CH(Q ) (a) No point of CH(Q(i)) is on CH(Q) A B (k) CH(Q ) Tangent lines are found through binary search in log time Analysis: T(p, p) = T(p /, p / ) + c log p c log p (j) CH(Q ) (i) CH(Q ) (k) CH(Q ) The initial sorting also takes O(log p) time (b) Points of CH(Q(i)) from A to B are on CH(Q) Fig 65 Finding points in a partial hull that belong to the combined hull Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 45

46 6B Implementation of Shared Memory Main challenge: Easing the memory access bottleneck Providing low-latency, high-bandwidth paths to memory Reducing the need for access to nonlocal memory Reducing conflicts and sensitivity to memory latency Topics in This Chapter 6B Processor-Memory Interconnection 6B Multistage Interconnection Networks 6B Cache Coherence Protocols 6B4 Data Allocation for Conflict-Free Access 6B5 Distributed Shared Memory 6B6 Methods for Memory Latency Hiding Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 46

47 About the New Chapter 6B This new chapter incorporates material from the following existing sections of the book: 66 Some Implementation Aspects 44 Dimension-Order Routing 5 Plus-or-Minus- i Network 66 Multistage Interconnection Networks 7 Distributed Shared Memory 8 Data Access Problems and Caching 8 Cache Coherence Protocols 8 Multithreading and Latency Hiding Coordination and Synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 47

48 Making PRAM Practical PRAM needs a read and a write access to memory in every cycle Even for a sequential computer, memory is tens to hundreds of times slower than arithmetic/logic operations; multiple processors accessing a shared memory only makes the situation worse Shared access to a single large physical memory isn t scalable Strategies and focal points for making PRAM practical Make memory accesses faster and more efficient (pipelining) Reduce the number of memory accesses (caching, reordering of accesses so as to do more computation per item fetched/stored) Reduce synchronization, so that slow memory accesses for one computation do not slow down others (synch, memory consistency) 4 Distribute the memory and data to make most accesses local 5 Store data structures to reduce access conflicts (skewed storage) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 48

49 6B Processor-Memory Interconnection Processors Memory modules Processorto-processor network Processorto-memory network p- m- Parallel I/O Options: Crossbar Bus(es) MIN Bottleneck Complex Expensive Fig 4 A parallel processor with global (shared) memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 49

50 Processor-to-Memory Network P P P P P 4 P 5 P 6 P 7 Crossbar switches offer full permutation capability (they are nonblocking), but are complex and expensive: O(p ) Even with a permutation network, full PRAM functionality is not realized: two processors cannot access different addresses in the same memory module Practical processor-tomemory networks cannot M M M M M 4 M 5 M 6 M 7 realize all permutations An 8 8 crossbar switch (they are blocking) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

51 Bus-Based Interconnections Single-bus system: Bandwidth bottleneck Bus loading limit Scalability: very poor Single failure point Conceptually simple Forced serialization Multiple-bus system: Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Simple serialization Memory I/O Proc Proc Proc Proc Memory I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

52 Back-of-the-Envelope Bus Bandwidth Calculation Single-bus system: Bus frequency: 5 GHz Data width: 56 b ( B) Mem Access: bus cycles (5G)/ = 8 GB/s Bus cycle = ns Memory cycle = ns mem cycle = 5 bus cycles Multiple-bus system: Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less) Memory I/O Proc Proc Proc Proc Mem Mem Mem Mem I/O Proc Proc Proc Proc Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

53 Hierarchical Bus Interconnection Bus switch (Gateway) Low-level cluster Heterogeneous system Fig 49 Example of a hierarchical interconnection architecture Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 5

54 Removing the Processor-to-Memory Bottleneck Processors Caches Memory modules Processorto-processor network Challenge: Cache coherence Processorto-memory network Fig 44 p- m- Parallel I/O A parallel processor with global memory and processor caches Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 54

55 Why Data Caching Works Hit rate r (fraction of memory accesses satisfied by cache) C eff = C fast + ( r)c slow Cache parameters: Size Block length (line width) Placement policy Replacement policy Write policy Address in Tag Index Block address Word offset Placement Option Placement Option State bits State bits Tag --- One cache block --- Tag --- One cache block --- The two candidate words and their tags are read out Fig 8 Data storage and access in a two-way set-associative cache Compare Compare = = Cache miss Mux Data out Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 55

56 Benefits of Caching Formulated as Amdahl s Law Hit rate r (fraction of memory accesses satisfied by cache) Access cabinet in s Access drawer in 5 s C eff = C fast + ( r)c slow S = C slow / C eff Register file Access desktop in s Cache memory = ( r) + Cfast /C slow Main memory Fig 8 of Parhami s Computer Architecture text (5) This corresponds to the miss-rate fraction r of accesses being unaffected and the hit-rate fraction r (almost ) being speeded up by a factor C slow /C fast Generalized form of Amdahl s speedup formula: S = /(f /p + f /p + + f m /p m ), with f + f + + f m = In this case, a fraction r is slowed down by a factor (C slow + C fast )/C slow, and a fraction r is speeded up by a factor C slow / C fast Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 56

57 6B Multistage Interconnection Networks Fig 5 Key elements of the Cray Y-MP processor Address registers, address function units, instruction buffers, and control not shown The Vector-Parallel Cray Y-MP Computer Inter- Processor Commun Central Memory Input/Output Vector Registers 6 6 CPU V V7 V6 V5 V4 V V V T Registers (8 64-bit) S Registers (8 64-bit) From address registers/units Vector Integer Units Add Shift Logic Weight/ Parity Floating- Point Add Units Multiply Reciprocal Approx Scalar Integer Units Add Shift Logic Weight/ Parity Winter 6 Parallel Processing, Shared-Memory Parallelism Slide bit 64 bit bit

58 Cray Y-MP s Interconnection Network P P Sections Subsections Memory Banks 8, 4, 8,, 8, 6, 4,, P P P 4 P 5 P 6 P , 5, 9,, 9, 6,,,, 7,,, 7,,, 55 Fig 6 The processor-to-memory interconnection network of Cray Y-MP Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 58

59 p Processors Butterfly Processor-to-Memory Network Fig 69 Example of a multistage memory access network log p Columns of -by- Switches p Memory Banks Two ways to use the butterfly: - Edge switches x and x - All switches x Not a full permutation network (eg, processor cannot be connected to memory bank alongside the two connections shown) Is self-routing: ie, the bank address determines the route A request going to memory bank ( ) is routed: lower upper upper Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 59

60 p Processors Butterfly as Multistage Interconnection Network log p Columns of -by- Switches p Memory Banks log p + Columns of -by- Switches Fig 69 Example of a multistage memory access network Generalization of the butterfly network High-radix or m-ary butterfly, built of m m switches Has m q rows and q + columns (q if wrapped) Fig 58 Butterfly network used to connect modules that are on the same side Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

61 Self-Routing on a Butterfly Network Fig 46 Example dimension-order routing paths dim dim dim Ascend Descend 4 4 Number of cross links taken = length of path in hypercube From node to 6: routing tag = = cross-straight-cross From node to 5: routing tag = = straight-cross-cross From node 6 to : routing tag = = cross-cross-cross Winter 6 Parallel Processing, Shared-Memory Parallelism Slide

62 Butterfly Is Not a Permutation Network dim dim dim A dim dim dim A B C B D 4 C D Fig 47 Packing is a good routing problem for dimensionorder routing on the hypercube 7 7 Fig 48 Bit-reversal permutation is a bad routing problem for dimensionorder routing on the hypercube 7 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

63 Structure of Butterfly Networks Dim Dim Dim Dim Dim Dim Dim Switching these two row pairs converts this to the original butterfly network Changing the order of stages in a butterfly is thus equi valent to a relabeling of the rows (in this example, row xyz becomes row xzy) Fig 55 Butterfly network with permuted dimensions The 6-row butterfly network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 6

64 Beneš Network Processors log p Columns of -by- Switches 4 Memory Banks Fig 59 Beneš network formed from two back-to-back butterflies A q -row Beneš network: Can route any q q permutation It is rearrangeable Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 64

65 Routing Paths in a Beneš Network To which memory modules can we connect proc 4 without rearranging the other paths? What about proc 6? q q+ q+ Inputs Rows, q + Columns Outputs Fig 5 Another example of a Beneš network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 65

66 Augmented Data Manipulator Network a b Data manipulator network was used in Goodyear MPP, an early SIMD parallel machine Augmented means that switches in a column are independent, as opposed to all being set to same state (simplified control) q Rows Fig 5 Augmented data manipulator network 7 a b 7 q + Columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 66

67 Fat Trees P P P 4 P P 5 P P 6 Fig 56 Two representations of a fat tree Front view: Binary tree Side view: Inverted binary tree Skinny tree? Fig 57 Butterfly network redrawn as a fat tree P 7 P 8 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 67

The Sea of Indirect Interconnection Networks Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological,

68 The Sea of Indirect Interconnection Networks Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes We have already seen the butterfly, hierarchical bus, beneš, and ADM networks Fig 48 (modified) The sea of indirect interconnection networks Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 68

69 Self-Routing Permutation Networks Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network) Permutation routing through a MIN is the same problem as sorting 7 () () () () 4 () () 6 () () () () 5 () () () () () () Sort by MSB Sort by the middle bit Sort by LSB Fig 64 Example of sorting on a binary radix sort network Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 69

70 Partial List of Important MINs Augmented data manipulator (ADM): aka unfolded PMI (Fig 5) Banyan: Any MIN with a unique path between any input and any output (eg butterfly) Baseline: Butterfly network with nodes labeled differently Beneš: Back-to-back butterfly networks, sharing one column (Figs 59-) Bidelta: A MIN that is a delta network in either direction Butterfly: aka unfolded hypercube (Figs 69, 54-5) Data manipulator: Same as ADM, but with switches in a column restricted to same state Delta: Any MIN for which the outputs of each switch have distinct labels (say and for switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output Flip: Reverse of the omega network (inputs outputs) Indirect cube: Same as butterfly or omega Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig 59) Permutation: Any MIN that can realize all permutations Rearrangeable: Same as permutation network Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

71 Proc- toproc network 6B Cache Coherence Protocols Processors p Caches w y w z x z Processorto-memory network Parallel I/O Memory modules w x Fig 8 Various types of cached data blocks in a parallel processor with global memory and processor caches y z w x y z Multiple consistent Single consistent Single inconsistent Invalid Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

72 Example: A Bus-Based Snoopy Protocol Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken CPU read hit, CPU write hit CPU read hit CPU write miss: Write back the block, Put write miss on bus Exclusive (read/write) CPU read miss: Write back the block, put read miss on bus Bus read miss for this block: Write back the block CPU write hit/miss: Put write miss on bus Bus write miss for this block: Write back the block CPU read miss: Put read miss on bus Shared (read-only) CPU read miss: Put read miss on bus CPU write miss: Put write miss on bus Invalid Bus write miss for this block Fig 8 Finite-state control mechanism for a bus-based snoopy cache coherence protocol Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

73 Implementing a Snoopy Protocol A second tags/state storage unit allows snooping to be done concurrently with normal cache operation Getting all the implementation timing and details right is nontrivial Duplicate tags and state store for snoop side =? =? Snoop state Snoop side cache control Cmd Tags State Tag Cache data array Addr CPU Cmd Addr Buffer Buffer Addr Cmd Fig 77 of Parhami s Computer Architecture text Main tags and state store for processor side Processor side cache control System bus Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 7

74 Scalable (Distributed) Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 74

75 Example: A Directory-Based Protocol Write miss: Fetch data value, request invalidation, return data value, sharing set = {c} Read miss: Return data value, sharing set = sharing set + {c} Exclusive (read/write) Read miss: Fetch data value, return data value, Read miss: sharing Fetch, set return = sharing data set value, + {c} sharing set = {c} Correction to text Write miss: Invalidate, sharing set = {c}, return data value Shared (read-only) Data write-back: Sharing set = { } Write miss: Return data value, sharing set = {c} Uncached Read miss: Return data value, sharing set = {c} Fig 84 States and transitions for a directory entry in a directory-based coherence protocol (c denotes the cache sending the message) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 75

76 Implementing a Directory-Based Protocol Sharing set implemented as a bit-vector (simple, but not scalable) When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory Processor Processor Processor Processor Cache Cache Cache Cache Head pointer Memory Coherent data block Noncoherent data blocks The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 86 in connection with the SCI standard) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 76

77 6B4 Data Allocation for Conflict-Free Access Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping Row,,,, 4, 5,,,,, 4, 5, Column,,,, 4, 5,,,,, 4, 5,,4,4,4,4 4,4 5,4,5,5,5,5 4,5 5,5 Module 4 5 Each matrix column is stored in a different memory module (bank) Accessing a column leads to conflicts Fig 66 Matrix storage in column-major order to allow concurrent accesses to rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 77

78 Row Column,,5,4, 4, 5, Skewed Storage Format,,,5,4 4, 5,,,,,5 4,4 5,,,,, 4,5 5,4,4,,, 4, 5,5,5,4,, 4, 5, Module 4 5 Fig 67 Skewed matrix storage for conflict-free accesses to rows and columns Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 78

79 A Unified Theory of Conflict-Free Access Vector indices A ij is viewed as vector element i + jm Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector A qd array can be viewed as a vector, with row / column accesses associated with constant strides Column: k, k+, k+, k+, k+4, k+5 Stride = Row: k, k+m, k+m, k+m, k+4m, k+5m Stride = m Diagonal: k, k+m+, k+(m+), k+(m+), k+4(m+), k+5(m+) Stride = m + Antidiagonal: k, k+m, k+(m ), k+(m ), k+4(m ), k+5(m ) Stride = m Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 79

80 Linear Skewing Schemes Vector indices A is viewed as vector element i + jm ij Fig 68 A 6 6 matrix viewed, in columnmajor order, as a 6-element vector Place vector element i in memory bank a + bi mod B (word address within bank is irrelevant to conflict-free access; also, a can be set to ) With a linear skewing scheme, vector elements k, k + s, k +s,, k + (B )s will be assigned to different memory banks iff sb is relatively prime with respect to the number B of memory banks A prime value for B ensures this condition, but is not very practical Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

81 6B5 Distributed Shared Memory Memories Processors p- Parallel I/O Interconnection network Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Fig 45 A parallel processor with distributed memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

82 Butterfly-Based Distributed Shared Memory Randomized emulation of the p-processor PRAM on p-node butterfly Use hash function to map memory locations to modules p locations p modules, not necessarily distinct One of p = q (q + ) processors With high probability, at most O(log p) of the p locations will be in modules located in the same row Average slowdown = O(log p) P M P M P M P M dim dim dim q + Columns Fig 7 Butterfly distributed-memory machine emulating the PRAM Memory module holding m/p memory locations Each node = router + processor + memory q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

83 PRAM Emulation with Butterfly MIN Emulation of the p-processor PRAM on (p log p)-node butterfly, with memory modules and processors connected to the two sides; O(log p) avg slowdown q One of p = processors Less efficient than Fig 7, which uses a smaller butterfly By using p / (log p) physical processors to emulate the p-processor PRAM, this new emulation scheme becomes quite efficient (pipeline the memory accesses of the log p virtual processors assigned to each physical processor) P dim dim dim q + Columns Fig 7 Distributed-memory machine, with a butterfly multistage interconnection network, emulating the PRAM M Memory module holding m/p memory locations q Rows Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 8

84 Deterministic Shared-Memory Emulation Deterministic emulation of p-processor PRAM on p-node butterfly One of p = q (q + ) processors Store log m copies of each of the m memory location contents Time-stamp each updated value A write is complete once a majority of copies are updated A read is satisfied when a majority of copies are accessed and the one with latest time stamp is used Why it works: A few congested links won t delay the operation P M P M P M P M dim dim dim q + Columns Write set log m copies Memory module holding m/p memory locations Each node = router + processor + memory q Rows Read set Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 84

85 PRAM Emulation Using Information Dispersal Instead of (log m)-fold replication of data, divide each data element into k pieces and encode the pieces using a redundancy factor of, so that any k / pieces suffice for reconstructing the original data Original data word and its k pieces Possible read set of size k/ Up-to-date pieces Recunstruction algorithm Possible update set of size k/ The k pieces after encoding (approx three times larger) Original data word recovered from k/ encoded pieces Fig 74 Illustrating the information dispersal approach to PRAM emulation with lower data redundancy Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 85

86 6B6 Methods for Memory Latency Hiding By assumption, PRAM accesses memory locations right when they are needed, so processing must stall until data is fetched Method : Predict accesses (prefetch) Method : Pipeline multiple accesses Processors Not a smart strategy: Memory access time = s times that of add time Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 86

87 6C Shared-Memory Abstractions A precise memory view is needed for correct algorithm design Sequential consistency facilitates programming Less strict consistency models offer better performance Topics in This Chapter 6C Atomicity in Memory Access 6C Strict and Sequential Consistency 6C Processor Consistency 6C4 Weak or Synchronization Consistency 6C5 Other Memory Consistency Models 6C6 Transactional Memory Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 87

88 6C Atomicity in Memory Access Performance optimization and latency hiding often imply that memory accesses are interleaved and perhaps not serviced in the order issued Processors Proc s access request Proc s access response Memory modules P rocessorto-processor network P rocessorto-memory network p- m- P a ra lle l I/O Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 88

89 Barrier Synchronization Overhead P P P P P P P P Synchronization overhead Given that AND is a semigroup computation, it is only a small step to generalize it to a more flexible global combine operation Time Done Done Reduction of synchronization overhead: Providing hardware aid to do it faster Using less frequent synchronizations Fig The performance benefit of less frequent synchronization Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 89

90 Synchronization via Message Passing Vertex v represents j j Task task or Computation computation jj Task interdependence is often more complicated than the simple prerequisite structure thus far considered x T p T Latency p Latency with with p pro p p T T Number Number of of nodes T T Depth Depth of the of the graph Schematic representation of data dependence A B x x Details of dependence Communication { latency t t t Time Process A: send x Process B: receive x B waits Fig Automatic synchronization in message-passing systems y 4 Output Output 6 9 Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

91 Synchronization with Shared Memory Accomplished by accessing specially designated shared control variables The fetch-and-add instruction constitutes a useful atomic operation If the current value of x is c, fetch-and-add(x, a) returns c to the process and overwrites x = c with the value c + a A second process executing fetch-and-add(x, b) then gets the now current value c + a and modifies it to c + a + b Why atomicity of fetch-and-add is important: With ordinary instructions, the steps of fetch-and-add for A and B may be interleaved as follows: Process A Process B Comments Time step read x A s accumulator holds c Time step read x B s accumulator holds c Time step add a A s accumulator holds c + a Time step 4 add b B s accumulator holds c + b Time step 5 store x x holds c + a Time step 6 store x x holds c + b (not c + a + b) Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

92 Barrier Synchronization: Implementations Make each processor, in a designated set, wait at a barrier until all other processors have arrived at the corresponding points in their computations Software implementation via fetch-and-add or similar instruction Hardware implementation via an AND tree (raise flag, check AND result) A problem with the AND-tree: If a processor can be randomly delayed between raising it flag and checking the tree output, some processors might cross the barrier and lower their flags before others have noticed the change in the AND tree output Solution: Use two AND trees for alternating barrier points fo fo fo fop fe fe fe fep Set AND tree Reset AND tree S Q Flipflop R Barrier Signal Fig 4 Example of hardware aid for fast barrier synchronization [Hoar96] Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

93 6C Strict and Sequential Consistency Strict consistency: A read operation always returns the result of the latest write operation on that data object Strict consistency is impossible to maintain in a distributed system which does not have a global clock While clocks can be synchronized, there is always some error that causes trouble in near-simultaneous operations Example: Three processes sharing variables -4 (r = read, w = write) Process A Process B Process C w r w w r r w4 r r r w r5 r w w4 r4 r r Real time Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 9

94 Sequential Consistency Sequential consistency (original def): The result of any execution is the same as if processor operations were executed in some sequential order, and the operations of a particular processor appear in the sequence specified by the program it runs Process A Process B w r w w r r w4 r r r w Process C r5 r w w4 r4 r r Time A possible ordering A possible ordering w4w r r5 r r w r r ww4w r4 w r r r r w4 r5 r w r r w r ww4w r4 r r r r r w Sequential consistency (new def): Write operations on the same data object are seen in exactly the same order by all system nodes Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 94

95 The Performance Penalty of Sequential Consistency Initially Thread Thread X = Y = X := Y := R := Y R :=X Exec Exec Exec X := Y := X := R := Y R :=X Y := Y := X := R := Y R := X R := Y R := X If a compiler reorders the seemingly independent statements in Thread, the desired semantics (R and R not being both ) is compromised Relaxed consistency (memory model): Ease the requirements on doing things in program order and/or write atomicity to gain performance When maintaining order is absolutely necessary, we use synchronization primitives to enforce it Winter 6 Parallel Processing, Shared-Memory Parallelism Slide 95

Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1

Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1 Part II Extreme Models Fall 8 Parallel Processing, Extreme Models Slide About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms