Parallel Architectures

Size: px

Start display at page:

Download "Parallel Architectures"

Louise Daniels
5 years ago
Views:

1 Parallel Architectures Instructor: Tsung-Che Chiang Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between 1960s and the mid-1990s, scientists and engineers explored a wide variety of parallel computer architectures. Experts passionately debated whether the dominant parallel computer systems would contain at most a few dozen high-performance processors or thousands of less-powerful processors. Today, most contemporary parallel computers are constructed out of commodity s. 2

2 Outline Interconnection Networks Processor Arrays Multiprocessors Multicomputers Flynn s Taxonomy Summary 3 Interconnection Networks Shared Medium It allows only one message at a time. Each processor listens to every message and receives the ones for which it is the destination. Ethernet is a well-known example. Message collisions can significantly degrade the performance of a heavily utilized shared medium. shared medium 4

3 Interconnection Networks Switched Medium It supports point-to-point messages among pairs of processors. Each processor has its own communication path to the switch. Two advantages over shared medium support of concurrent transmission support of network scaling switched medium 5 Switch Network Topologies A switch network can be represented by a graph nodes: processors/switches Each processor is connected to one switch. Switches connect processors and/or other switches. edges: communication paths Direct vs. Indirect topology Direct: the ratio of switch nodes to processor nodes is 1:1. Indirect: the above ratio is greater than 1:1. 6

4 Switch Network Topologies Evaluation criteria Diameter ( ): the largest distance between two switch nodes Bisection width ( ): the minimum number of edges between switch nodes that must be removed to divide the network into two halves Edges per switch node It is best if this value is a constant independent of the network size (better scalability). Constant edge length It is best if the nodes and edges of the network can be laid out in 3-D space so that the maximum edge length is a constant independent of the network size. 7 Switch Network Topologies In the following, we will discuss six switch network topologies: 2-D mesh binary tree hypertree butterfly hypercube shuffle-exchange 8

5 2-D Mesh Network Properties direct topology Assuming n switch nodes and no wraparound connections minimum diameter: 2(n 1/2 1) maximum bisection width: n 1/2 edges/node: 4 constant edge length switch processor 9 Binary Tree Network Properties indirect toplogy Assuming n = 2 d processors (with 2n 1 switches) diameter: 2 log n bisection width: 1 edges/node: 3 non-constant edge length depth d 10

6 Hypertree Network (1/3) It shares the low diameter of binary tree but has an improved bisection width. For a hypertree of degree k and depth d: From the front, it looks like a complete k-ary tree. From the side, it looks like an upside-down binary tree. k = 4, d = 2 front view side view 11 Hypertree Network (2/3) 12

7 Hypertree Network (3/3) Properties indirect topology Assuming k = 4, n = 4 d processors, 2 d (2 d +1 1)switches diameter: 2d (i.e., log n) bisection width: 2 d+1 edges/node: no more than 6 non-constant edge length 13 Butterfly Network (1/6) n = 2 d processors Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 Rank d 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 14

8 Butterfly Network (2/6) ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 (i, j 1) 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 (i, j) 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 15 Butterfly Network (3/6) , ,0 2,0 0,1 1,1 2,1 0,2 1,2 2,2 0,3 1,3 2,3 0,4 1, ,4 0,5 1,5 2,5 0,6 1,6 2, ,7 1,7 2,7 inverting the i th most significant bit in the binary representation of j (i, m) (i, j) 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,

9 Butterfly Network (4/6) Where the butterfly is. As the rank number decrease, the widths of the wings of the butterflies increase exponentially. (Hence, non-constant edge length) ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 17 Butterfly Network (5/6) Message routing Each switch node picks off the lead bit from the message. 0 left, 1 right message 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 01 message 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 assume the same 1 message 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 message 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,

10 Butterfly Network (6/6) Properties indirect topology Assuming n = 2 d processors, n(log n + 1) switches, switch nodes on ranks 0 and log n are the same diameter: log n bisection width: n/2 edges/node: 4 non-constant edge length 19 Hypercube Network (1/4) A hypercube network, also called a binary n-cube, is a butterfly in which each column of switch nodes is collapsed into a single node ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 3-D Hypercube ,0 1,1 1,2 1,3 1,4 1,5 1,6 1, ,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,

11 Hypercube Network (2/4) The processor and its associated switches are labeled 0, 1,..., 2 d 1; two switches are adjacent if their binary labels differ in exactly one bit position. d = d = Hypercube Network (3/4) Properties direct topology Assuming n = 2 d processors, diameter: log n bisection width: n/2 edges/node: log n non-constant edge length 22

12 Hypercube Network (4/4) Message Routing Note that edges always connect switches whose addresses differ in exactly one bit position. Example: Send a message from 0101 to 0011 Path1: Path2: Shuffle-Exchange Network (1/5) Perfect shuffle sorting, dividing it exactly in half, and shuffling the two halves perfectly

13 Shuffle-Exchange Network (2/5) Perfect shuffle The new position can be calculated by performing a left cyclic rotation of the binary number Shuffle-Exchange Network (3/5) Connections exchange: link switches whose numbers differ in their least significant bit shuffle: links switch i and j, where j is the result of cycling the bits of i left one position

14 Shuffle-Exchange Network (4/5) Properties direct topology Assuming n = 2 d processors, diameter: 2log n 1 bisection width: n/log n edges/node: 2 non-constant edge length Shuffle-Exchange Network (5/5) Message Routing The worst-case scenario is routing a message from switch 0 to switch n 1 (or vice versa) From 0000 to 1111: E S E S E S E From 0011 to 0101: E S E S S

15 Interconnection Networks No network can be optimal in every regard. #Processors #switch diameter bisection width edges/ node eonstant edge len. 2-D mesh n = d 2 n 2(n 1/2 1) n 1/2 4 Yes Binary tree n = 2 d 2n 1 2 log n 1 3 No 4-ary hypertree n = 4 d 2n n 1/2 log n 2 n 1/2 6 No Butterfly n = 2 d n(logn+1) log n n / 2 4 No Hypercube n = 2 d n log n n / 2 log n No Shuffleexchange n = 2 d n 2log n 1 n log n 2 No 29 Processor Arrays (1/11) Vector computer a computer whose instruction set includes operations on vectors as well as scalars Two general ways of implementation pipelined vector processor It streams vectors from memory to the, where pipelined arithmetic units manipulate them. Early supercomputers (Cray-1) are well-known examples. processor arrays It has a set of identical, synchronized processing elements capable of simultaneously performing the same operation on different data. motivation: high price of a control unit, data parallelism 30

16 Processor Arrays (2/11) Architecture Front-end computer Memory I/O processors Processor array scalar memory bus global result bus P P P P P P P P instruction broadcast bus M M M M M M M M Interconnection network Parallel I/O devices 31 Processor Arrays (3/11) Performance the amount of work accomplished per time unit depends on the utilization of its processors Example 2.1 a processor array with 1024 processors adding two integers in 1 µ second performance when adding two integer vectors of length 1024, assuming each vector is allocated to the processors in a balanced fashion 1024 operations Performance = = µ second 9 (operations/second) 32

17 Processor Arrays (4/11) Performance the amount of work accomplished per time unit depends on the utilization of its processors Example 2.2 a processor array with 512 processors adding two integers in 1 µ second performance when adding two integer vectors of length 600, assuming each vector is allocated to the processors in a balanced fashion 600 operations Performance = = µ second 8 (operations/second) 88 processors add 2 pairs of integers. The others add only one pair and sit idle while the 88 processors add their second integer pair. 33 Processor Arrays (5/11) Interconnection Network It is used to bring together operands stored in the memories of different processors. The most popular interconnection network for processor arrays is the 2-D mesh. It has the advantage of a relatively straightforward implementation in VLSI, where a single chip may contain a large number of processors. 4x4 8x12 34

18 Processor Arrays (6/11) Enabling & Disabling Processors It is possible for only a subset of the processors to perform an instruction by masking. useful when the number of data items is not an exact multiple of the size of the processor array useful to support conditionally executed parallel operations 35 Processor Arrays (7/11) Enabling & Disabling Processors Example (Fig. 2.12) if (a[i]!= 0) a[i] = 1; else a[i] = -1; indicates the processors that are masked out (inactive) 36

19 Processor Arrays (8/11) Enabling & Disabling Processors Efficiency of the processor array can drop rapidly when the programs enter conditionally executed code. There is additional overhead of performing the tests to set the mask bits. There is the inefficiency caused by having to work through different branches of control structures sequentially. In the previous example, the performance is less than 50% (of the performance when performing operations across the entire processor array) when the additional overhead is considered. 37 Processor Arrays (9/11) Additional Architecture Features Front-end computer Memory I/O processors Processor array scalar memory bus global result bus P P P P P P P P instruction broadcast bus M M M M M M M M Interconnection network Parallel I/O devices 38

20 Processor Arrays (10/11) Memory bus It allows particular elements of parallel variables to be used or defined in sequential code. In this way, the processor array can be viewed as an extension of the memory space of the front-end. Global result bus It enables values from the processor array to be combined and returned to the front end. The ability to compute a global and is valuable. 39 Processor Arrays (11/11) Shortcomings 1. Not all problems map well into a strict data-parallel solution. 2. The efficiency drops when entering conditionally executed parallel code. 3. They do not easily accommodate multiple users. 4. They do not scale down well due to the cost of highbandwidth communication networks. 5. They are built using custom VLSI, and thus losing the costeffectiveness of commodity s. 6. The original motivation the relatively high cost of control units is no longer valid in today s s. Processor arrays are no longer considered a viable option for general-purpose parallel computers. 40

21 Multiprocessors A multiprocessor is a multi- computer with shared memory. The same address on two different s refer to the same memory location. Comparing with processor arrays, they can be built out of commodity s, they naturally support multiple users, and they do not lose efficiency when encountering conditionally executed parallel code. 41 Multiprocessors We discuss two fundamental types of multiprocessors: centralized multiprocessors, in which all the primary memory is in one place distributed multiprocessors, in which the primary memory is distributed among the processors 42

22 Centralized Multiprocessors (1/5) A centralized multiprocessor is a straightforward extension of the uniprocessor. It is also called uniform memory access (UMA) symmetric multiprocessor (SMP) The presence of large and efficient caches makes multiprocessors practical. Still, memory bus bandwidth typically limits to a few dozen the number of processors that can be employed. Cache Cache Cache Cache Primary memory Bus I/O devices 43 Centralized Multiprocessors (2/5) Data private: used only by a single processor shared: used by multiple processors Designers of centralized multiprocessors must address two problems associated with shared data: cache coherence problem processor synchronization 44

23 Centralized Multiprocessors (3/5) Cache Coherence Problem Memory Memory Memory Memory X 7 X 7 X 7 X 2 Cache Cache A B A B A B A B 45 Centralized Multiprocessors (4/5) Cache Coherence Problem Snooping protocol are typically used to maintain cache coherence on centralized multiprocessors. Each s cache controller monitors the bus to identify which cache blocks are being requested by other s. Before the write occurs, all copies of the data item cached by other processors are invalidated. If two processors simultaneously try to write to the same memory location, only one of them wins the race. 46

24 Centralized Multiprocessors (5/5) Processor Synchronization mutual exclusion a situation in which at most one process can be engaged in a specified activity at any time barrier synchronization guarantees that no process will proceed beyond a designated point in the program until every process has reached the barrier 47 Distributed Multiprocessors (1/10) Architecture Cache memory Cache memory Cache memory Cache memory Bus Primary memory I/O devices Cache memory Cache memory Cache memory Memory I/O devices Memory I/O devices Memory I/O devices Interconnection network 48

25 Distributed Multiprocessors (2/10) Rationale & advantage spatial and temporal locality memory references are between processor and its local memory higher aggregate memory bandwidth and lower memory access time higher processor count Distributing I/O, too, can also improve scalability. 49 Distributed Multiprocessors (3/10) The same address on different processors refers to the same memory location. Memory access time varies considerably, depending upon whether the address being referenced is in that processor s local memory. Thus, it is also called a nonuniform memory access (NUMA) multiprocessor. 50

26 Distributed Multiprocessors (4/10) Cache Coherence Alternative1: Only storing instructions and private data in a processor s cache poor performance due to huge time difference between a local cache access and a nonlocal memory access Alternative2: Snooping methods do not scale well as # of processors because a cache controller cannot simply snoop on a shared memory bus and a more complicated protocol is needed Cache Cache Cache Cache memory memory memory memory Cache memory Cache memory Cache memory Bus Memory I/O Memory devices I/O Memory devices I/O devices Primary memory I/O devices Interconnection network 51 Distributed Multiprocessors (5/10) Cache Coherence Alternative3: directory-based protocol a single directory contains sharing information about every memory block that may be cached Status of a memory block uncached: not currently in any processor s cache shared: cached by one or more processors, and the copy in memory is correct exclusive: cached by exactly one processor that has written the block, so that the copy in memory is obsolete 52

27 Distributed Multiprocessors (6/10) Cache Coherence In addition to the block status, we also need to keep track of which processors have copies of any cache block, so that these copies can be invalidated when one processor writes a value to that block. To prevent accesses to the cache directory from becoming a performance bottleneck, the directory itself should be distributed among the computer s local memories. The information about a particular memory block is in exactly one location. 53 Distributed Multiprocessors (7/10) Directories Memories Caches Interconnection network U000 X Interconnection network S101 Interconnection network S100 X 7 X read Interconnection network E100 X 7 X 7 out-of-date X 7 X 7 X 6 invalidate read write 54

28 Distributed Multiprocessors (8/10) Interconnection network Interconnection network Directories E100 S110 Memories X 7 X 6 Caches X 6 X 6 X read Interconnection network Interconnection network E001 E100 X 6 X 5 X 5 X write write 55 Distributed Multiprocessors (9/10) Directories Memories Caches Interconnection network E100 X 5 X Interconnection network U000 X flush 56

29 Multicomputers A multicomputer is another example of a distributed-memory, multiple- computer. Unlike a NUMA multiprocessor, a multicomputer has disjoint local address space. The same address on different processors refers to different physical memory locations. Each processor only has direct access to its local memory. Processors interact with each other by passing messages, and there is no cache coherence problem. 57 Multicomputers Commercial multicomputers vs. commodity multicomputers custom vs. mass-produced computers low-latency vs. high-latency expensive vs. cheap 58

30 Multicomputers Designs User Asymmetrical Multicomputer Front-end computer Interconnection network Internet Symmetrical Multicomputer File server Interconnection network 59 Asymmetrical Multicomputers Advantages Back-end processors are used exclusively for executing parallel programs. They may be running a primitive OS. It is easier for the manufacturer to develop the primitive OS. Without processes occupying cycles or sending messages, it is easier to understand, model, and tune the performance of a parallel application. 60

31 Asymmetrical Multicomputers Disadvantages Users login into the front-end computer, which executes a full, multiprogrammed OS and provides all functions needed for program development. single point of failure scalability limited by the front-end multiple front-end computers? How do users know which front-end computer to log in? How will the workload be balanced? How are back-end nodes assigned to front-end processors? a centralized multicomputer? Underutilization condition might be frustrating. 61 Asymmetrical Multicomputers Disadvantages (continued) program debugging Without supporting I/O operations, the back-end nodes must send a message to the front-end computer to print the contents to users. requirement of development of two distinct programs front-end: interacting with users and the file system, transmitting data to the back-end processors, forwarding results to the outside world back-end: computationally intensive portion 62

32 Symmetrical Multicomputers The difficulty of debugging parallel programs is a strong incentive to provide full-featured I/O facilities on back-end nodes. A straightforward way is to run a multiprogrammed OS on the back-end processors, too. In a symmetrical multicomputer, every computer executes the same OS and has identical functionality. Users may log into any computer to edit and compile their programs. Any or all of the computers may be involved in the execution of a particular parallel program. 63 Symmetrical Multicomputers Advantages over asymmetrical multicomputers They alleviate the performance bottleneck caused by the front-end computer. Support for debugging is better since every computer runs a full-fledged OS. They also eliminate the front-end/back-end programming problem. Every processor executes the same problem. The if statement can be used to select partial processors. 64

33 Symmetrical Multicomputers Disadvantages It is more difficult to maintain the illusion of a single parallel computer. There is no simple way to balance the program development workload among all the processors. It is more difficult to achieve high performance from parallel programs when processes must compete with other processes for cycles cache space memory bandwidth 65 A Mixed Model ParPar cluster at the Hebrew University of Jerusalem Multicomputer User Front-end computer Internet Switched Ethernet Myrinet Switch File server 66

34 Commodity Cluster vs. Networks of Workstation A network of workstation It is a dispersed collection of computers, typically located on users desks. It is to serve the needs of the person using it. Individual workstations may have different OS and executable programs. Commodity cluster It is a co-located collocation of mass-produced computers. The computers are usually accessible only via the network. Some of the computers may not allow users to log in. The networking medium should have high speed. Latency Bandwidth Cost/node Fast Ethernet 100 µsec 100 Mbit/sec < $100 Gigabit Ethernet 100 µsec 1,000 Mbit/sec < $1,000 Myrinet 7 µsec 1,920 Mbit/sec < $2, Flynn s Taxonomy Data stream Single Multiple SISD SIMD Instruction stream Single Multiple Uniprocessors MISD Systolic arrays Processor arrays Pipelined vector processors MIMD Multiprocessors Multicomputers 68

35 Flynn s Taxonomy Systolic array, an example of an MISD computer primitive sorting element b c a First phase min(a, b, c) med(a, b, c) max(a, b, c) Second phase 69 Systolic Array Insert 7 4 Host Host

36 Systolic Array Extract minimum 4 8 Host Host Systolic Array Extract minimum 5 7 Host Host 72

37 Summary Processor Arrays 1. Not all problems map well into a strict data-parallel solution. 2. The efficiency drops when entering conditionally executed parallel code. 3. They do not easily accommodate multiple users. 4. They do not scale down well due to the cost of highbandwidth communication networks. 5. They are built using custom VLSI, and thus losing the costeffectiveness of commodity s. 6. The original motivation the relatively high cost of control units is no longer valid in today s s. Processor arrays are no longer considered a viable option for general-purpose parallel computers. 73 Summary Centralized Multiprocessors Cache coherence problem snooping protocol write invalidation protocol Cache Cache Cache Cache Bus Synchronization Primary I/O memory devices mutual exclusion & barrier replying upon hardware instructions that have the net effect of atomically reading and updating a memory location Small number of s limited by the shared memory bus 74

38 Summary Distributed Multiprocessors a single global address space more difficult cache coherence directory-based scheme Cache memory Cache memory Cache memory Memory I/O Memory devices I/O Memory devices I/O devices Interconnection network 75 Summary Multicomputers multiple joint address spaces no cache coherence problem Whether or not a copy of a data item is up-to-date or not depends entirely upon the programmer. Symmetrical Multicomputer Interconnection network User Internet File server Asymmetrical Multicomputer Front-end computer Interconnection network 76

Parallel Architectures

Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s