Chapter 18. Parallel Processing. Yonsei University

Size: px

Start display at page:

Download "Chapter 18. Parallel Processing. Yonsei University"

Trevor Gray
5 years ago
Views:

1 Chapter 18 Parallel Processing

2 Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2

3 Types of Parallel Processor Multiple Processor Organizations Types of Parallel Processor Systems Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD 18-3

4 Taxonomy Of Parallel Processor Multiple Processor Organizations 18-4

5 SISD Single processor Single instruction stream Data stored in single memory Uni-processor Multiple Processor Organizations 18-5

6 SIMD Multiple Processor Organizations Single machine instruction Controls simultaneous execution Number of processing elements Lockstep basis Each processing element has associated data memory Each instruction executed on different set of data by different processors Vector and array processors 18-6

7 SIMD With distributed memory Multiple Processor Organizations 18-7

8 MISD Multiple Processor Organizations Sequence of data Transmitted to set of processors Each processor executes different instruction sequence Never been implemented 18-8

9 MIMD Multiple Processor Organizations Set of processors Simultaneously execute different instruction sequences Different sets of data SMPs, clusters and NUMA systems 18-9

10 MIMD With shared memory Multiple Processor Organizations 18-10

11 MIMD With distributed memory Multiple Processor Organizations 18-11

12 MIMD - Overview Multiple Processor Organizations General purpose processors Each can process all instructions necessary Further classified by method of processor communication 18-12

13 Tightly Coupled - SMP Multiple Processor Organizations Processors share memory Communicate via that shared memory Symmetric Multiprocessor (SMP) Share single memory or pool Shared bus to access memory Memory access time to given area of memory is approximately the same for each processor 18-13

14 Tightly Coupled - NUMA Multiple Processor Organizations Nonuniform memory access Access times to different regions of memroy may differ 18-14

15 Loosely Coupled - Clusters Multiple Processor Organizations Collection of independent uniprocessors or SMPs Interconnected to form a cluster Communication via fixed path or network connections 18-15

16 Design Issues Multiple Processor Organizations Physical Organization Interconnection Structures Interprocessor Communication Operating System Design Application Software Techniques 18-16

17 SMP Characteristics Symmetric Multiprocessors Two or more similar processors Share the same main memory and I/O facilities All processors share access to I/O devices All processors can perform the same functions The system is controlled by an integrated OS Providing interaction between processors Interaction at job, task, file and data element levels 18-17

18 SMP Advantages Symmetric Multiprocessors Performance If some work can be done in parallel Availability Since all processors can perform the same functions, failure of a single processor does not halt the system Incremental growth User can enhance performance by adding additional processors Scaling Vendors can offer range of products based on number of processors 18-18

19 Multiprogramming & Multiprocessing Interleaving(multiprogramming) Symmetric Multiprocessors 18-19

20 Multiprogramming & Multiprocessing Interleaving and overlapping(multiprocessing) Symmetric Multiprocessors 18-20

21 Tightly Coupled Symmetric Multiprocessors Time-shared or common bus Multiport memory Central control unit 18-21

22 Time-Shared Bus Symmetric Multiprocessors To facilitate DMA transfers from I/O processors, the following features are provided Addressing Distinguish modules on the bus to determine the source and destination of data Arbitration I/O module can temporarily function as master Time sharing When module is controlling the bus, other modules are locked out and must, if necessary, suspend operation until bus access is achieved 18-22

23 Time Shared Bus Symmetric Multiprocessors 18-23

24 Time Shared Bus : Advantages Symmetric Multiprocessors Simplicity The simplest approach to multiprocessor organization Flexibility Easy to expand the system by attaching more processors to the bus Reliability The bus is essentially a passive medium 18-24

25 Time Shared Bus : Drawbacks Symmetric Multiprocessors Performance The speed of the system is limited by the bus cycle time To equip each processor with a cache memory for improving performance Each processor should have local cache Reduce number of bus accesses Leads to problems with cache coherence Solved in hardware - see later 18-25

26 Multiport Memory Symmetric Multiprocessors The multiport memory allows the direct, independent access of main memory modules by each processor and I/O module Direct independent access of memory modules by each processor Logic required to resolve conflicts Little or no modification to processors or modules required 18-26

27 Multiport Memory Symmetric Multiprocessors Advantages Little or no modification is needed for either processor or I/O modules to accommodate multiport memory Better performance than bus approach Security To configure portions of memory as private to one or more processors and/or I/O module Disadvantage Logic associated with memory is required for resolving conflicts More complex than the bus approach 18-27

28 Central Control Unit Symmetric Multiprocessors To separate data streams back and forth between independent modules: processor, memory, I/O module The controller can buffer requests and perform arbitration and timing functions To pass status and control messages between processors To perform cache update alerting Advantages flexibility and simplicity of interfacing of the bus approach Disadvantages the control unit is quite complex A potential performance bottleneck 18-28

29 Multiprocessor OS Deisgn Symmetric Multiprocessors An SMP OS manages processor and other computer resources so that the user perceives a single OS controlling system resources The key design issues Simultaneous concurrent processes Scheduling Synchronization Memory management Reliability and fault tolerance 18-29

30 Symmetric S/390 SMP Multiprocessors 18-30

31 S/390 SMP Components PU CISC microprocessor 64-KB L1 cache L2 cache 384 KB To be arranged in cluster of two Each cluster supports three PUs Bus-switching network adapter(bsn) to interconnect the L2 caches and the main memory To includes a level 3 (L3) cache (2 MB) Memory card 8G per card Symmetric Multiprocessors 18-31

32 Switched Interconnection Symmetric Multiprocessors The single bus becomes a bottleneck affecting the scalability The S/390 copes with this problem in two ways Main memory is split into four separate cards, each with its own storage controller that can handle memory accesses at high speeds The connection from processors to a single memory card is not in the form of a shared bus but rather point to point links, where each link connects a group of three processors via an L2 cache to a BSN 18-32

33 Shared L2 Caches Symmetric Multiprocessors Needs of Shared L2 Caches In moving from G3(generation 3) to G4, IBM doubled the speed of the microprocessors. If the G3 organization was retained, a significant increase in bus traffic would occur. At the same time, it was desired to reuse as many G3 components as possible. Without a significant bus upgrade, the BSNs would become a bottleneck Analysis of typical S/390 workloads revealed a large degree of sharing of instructions and data among processors The use of one or more L2 caches, each of which was shared by multiple processors 18-33

34 L3 Cache Symmetric Multiprocessors Each L3 cache provides a buffer between L2 caches and one memory card To provides the data much more quickly than a main memory access if the requested cache line is already shared by other processors but was not recently used by the requesting processor Memory subsystem Access Penalty (PU cycles) Cache Size Hit Rate(%) L1 cache 1 32 KB 89 L2 cache KB 5 L3 cache 14 2 MB 3 Memory 32 8 GB

35 Cache Coherence Problem Cache Coherence and the MESI Protocol Problem - multiple copies of same data in different caches Can result in an inconsistent view of memory Write back Write operations are usually made only to the cache Main memory is only updated when the corresponding cache line is flushed from the cache Write through All write operations are made to main memory as well as to the cache, ensuring that main memory is always valid 18-35

36 Software Solutions Cache Coherence and the MESI Protocol Compiler-based coherence mechanisms : Compiler and operating system deal with problem Overhead transferred to compile time Design complexity transferred from hardware to software Simplest approach Prevent any shared data variables from being cached Efficient approach Analyze the code to determine safe periods for shared variables

37 Hardware Solutions Cache Coherence and the MESI Protocol Cache coherence protocols Dynamic recognition at run time of potential inconsistency conditions More efficient use of cache Hardware schemes can be divided into two categories Directory protocols Snoopy protocols 18-37

38 Directory Protocols Cache Coherence and the MESI Protocol Collect and maintain information about copies of data in cache Directory stored in main memory Requests are checked against directory Appropriate transfers are performed Creates central bottleneck Effective in large scale systems with complex interconnection schemes 18-38

39 Snoopy Protocols Cache Coherence and the MESI Protocol Distribute cache coherence responsibility among cache controllers Cache recognizes that a line is shared Updates announced to other caches Suited to bus based multiprocessor Increases bus traffic 18-39

40 Snoopy Protocols Cache Coherence and the MESI Protocol Two approach to the snoopy protocol Write-invalidate Write-update (Write-broadcast) Performance depends on the number of local caches and pattern of memory reads and writes To adaptive protocols that employ both 18-40

41 Write Invalidate Cache Coherence and the MESI Protocol Multiple readers, one writer When a write is required, all other caches of the line are invalidated Writing processor then has exclusive (cheap) access until line required by another processor Used in Pentium II and PowerPC systems State of every line is marked as modified, exclusive, shared or invalid MESI 18-41

42 Write Update Cache Coherence and the MESI Protocol Multiple readers and writers When a processor wishes to update a shared line, the word to be updated is distributed to all others and caches containing that line can update it 18-42

43 MESI Protocol Cache Coherence and the MESI Protocol Modified : The line in the cache has been modified and is available only in this cache Exclusive : The line in the cache is the same as that in main memory and is not present in any other cache Shared : The line in the cache is the same as that in main memory and may be present in another cache Invalid : The line in the cache does not contain valid data 18-43

44 MESI State Transition Diagram Cache Coherence and the MESI Protocol 18-44

45 MESI State Transition Diagram (a) Line in cache at initiating processor Cache Coherence and the MESI Protocol 18-45

46 MESI State Transition Diagram (b) Line in snooping cache Cache Coherence and the MESI Protocol 18-46

47 Read Miss Cache Coherence and the MESI Protocol When a read miss occurs in the local cache, the processor initiates a memory read to read the line of main memory containing the missing address The processor inserts a signal on the bus that alerts all other processor/cache units to snoop the transaction 18-47

48 Read Hit Cache Coherence and the MESI Protocol When a read hit occurs on a line currently in the local cache, the processor simply reads the required item There is no state change The state remains modified, shared, or exclusive 18-48

49 Write Miss Cache Coherence and the MESI Protocol When a write miss occurs in the local cache, the processor initiates a memory read to read the line of main memory containing the missing address RWITM(read-with-intent-to-modify) When the line is loaded, it is immediately marked modified 18-49

50 Write Hit Cache Coherence and the MESI Protocol When a write hit occurs on a line currently in the local cache, the effect depends on the current state of that line in the local cache: Shared Exclusive Modified 18-50

51 L1-L2 L2 Cache Consistency Cache Coherence and the MESI Protocol Needs to maintain data integrity across both levels of cache and across all caches in the SMP configuration To extend the MESI protocol to the L1 caches Each line in the L1 cache includes bits to indicate the state Why? To adopt the write-through policy in the L1 cache If L1 has a write-back, more complex The write through is to the L2 cache and not to the memory 18-51

52 Clustering Clusters Clustering is an alternative to symmetric multiprocessing(smp) Cluster A group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine Whole Computer A system that can run on its own, apart from the cluster Benefits Absolute scalability Incremental scalability High availability Superior price/performance 18-52

53 Cluster Configurations (a) Standby Server with No Shared Disk Clusters 18-53

54 Cluster Configurations (b) Shared Disk Clusters 18-54

55 Clustering Methods Clusters Clustering Method Description Benefits Limitations Passive standby A secondary server takes over in case of primary server failure Easy to implement High cost because the secondary server is unavailable for other processing tasks. Active secondary The secondary server is also used for processing tasks Reduced cost because secondary servers can be used for processing Increased complexity Separate servers Servers connected to disks Servers share disks Separate servers have their own disks. Data is continuously copied from primary to secondary server Servers are cabled to the same disks, but each server owns its disks. If one server fails, its disks are taken over by the other server Multiple servers simultaneously share access to disks High availability Reduced network and server overhead due to elimination of copying operations Low network and server overhead. Reduced risk of downtime caused by disk failure High network and server overhead due to copying operations Usually requires disk mirroring or RAID technology to compensate for risk of disk failure. Requires lock manager software. Usually used with disk mirroring or RAID technology

56 Operating System Design Issues Clusters Failure Management Failover Failback Load Balancing Clusters versus SMP 18-56

57 Failure Management Clusters A highly available cluster A high probability that all resources will be in service If a failure does occur, the queries in progress are lost Any lost query, if retried, will be serviced by a different computer in the cluster A fault-tolerant cluster All resources are always available Achieved by the use of redundant shared disks and mechanisms for backing out uncommitted transactions Failover : Switching resources over from a failed to alternative system Failback : restoring resources to the original system 18-57

58 Load Balancing Clusters Incremental scalability Automatically include new computers in scheduling Middleware needs to recognise that processes may switch between machines 18-58

59 Parallelizing Clusters Single application executing in parallel on a number of machines in cluster Complier Determines at compile time which parts can be executed in parallel Split off for different computers Application Application written from scratch to be parallel Message passing to move data between nodes Hard to program Best end result Parametric computing If a problem is repeated execution of algorithm on different s ets of data e.g. simulation using different scenarios Needs effective tools to organize and run 18-59

60 Cluster Computer Architecture Clusters 18-60

61 Cluster Middleware Unified image to user Single system image Single point of entry Single file hierarchy Single control point Single virtual networking Single memory space Single job management system Single user interface Single I/O space Single process space Checkpointing Process migration Clusters 18-61

62 Clusters vs SMP Both provide multiprocessor support to high dem and applications Both available commercially SMP for longer The advantages of the SMP approach SMP is easier to manage and configure than a cluster SMP usually takes up less physical space and draws less power than cluster The advantages of the cluster approach To result in clusters dominating the high-performance server market Superior in terms of incremental and absolute scalability Superior in terms of availability All components of the system can readily be made highly redundant Clusters

63 Nonuniform Memory Access (NUMA) Nonuniform Memory Access Alternative to SMP & clustering Uniform memory access All processors have access to all parts of memory using load & store Access time to all regions of memory is the same Access time to memory for different processors same As used by SMP Nonuniform memory access All processors have access to all parts of memory using load & store Access time of processor differs depending on region of memory Different processors access different regions of memory at different speeds Cache coherent NUMA Cache coherence is maintained among the caches of the various processors Significantly different from SMP and clusters 18-63

64 Motivation Nonuniform Memory Access SMP has practical limit to number of processors Bus traffic limits to between 16 and 64 processors In clusters each node has own memory Apps do not see large global memory Coherence maintained by software not hardware NUMA retains SMP flavour while giving large scale multiprocessing e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors The objective with NUMA is to maintain a transparent system-wide memory while permitting multiple multiprocessor nodes, each with its own bus or other internal interconnect system 18-64

65 CC-NUMA Organization Nonuniform Memory Access 18-65

66 CC-NUMA Operation Each processor has own L1 and L2 cache Each node has own main memory Nodes connected by some networking facility Each processor sees single addressable memory space Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory Delivered to requesting (local to processor) cache Automatic and transparent Nonuniform Memory Access

67 Organization Nonuniform Memory Access Example Suppose the P2-3 requests a memory location 798, the memory of node 1 1. P2-3 issues read request on the snoopy bus of node 2 2. The directory on node 2 sees the request and recognizes that the location is in lode1 3. Node 2 s directory sends a request to node 1 4. Node 1 s directory requests the contents of Node 1 s main memory responds by putting the requested data on bus 6. Node 1 s directory picks up the data from the bus 7. The value is transferred back to node 2 s directory 8. Node 2 s directory places the data back no node 2 bus 9. The value is picked up and placed in P2-3 s cache and delivered to P

68 Cache Coherence Nonuniform Memory Access Node 1 directory keeps note that node 2 h as copy of data If data modified in cache, this is broadcast to other nodes Local directories monitor and purge local cache if necessary Local directory monitors changes to local data in remote caches and marks memory invalid until writeback Local directory forces writeback if memor y location requested by another processor 18-68

69 NUMA Pros and Cons Nonuniform Memory Access The advantage of a CC-NUMA system Can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes The disadvantages of a CC-NUMA system Performance can breakdown if too much access to remote memory Can be avoided by: L1 & L2 cache design reducing all memory access» Need good temporal locality of software Good spatial locality of software Virtual memory management moving pages to nodes that are using them most Not transparently look like an SMP Software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system Availability 18-69

70 Vector Computation Vector Computation Maths problems involving physical processes present different difficulties for computation Aerodynamics, seismology, meteorology Continuous field simulation High precision Repeated floating point calculations on large arrays of numbers 18-70

71 Vector Computation Vector Computation Supercomputers handle these types of problem Hundreds of millions of flops $10-15 million Optimised for calculation rather than multitasking and I/O Limited market Research, government agencies, meteorology Array processor Alternative to supercomputer Configured as peripherals to mainframe & mini Just run vector portion of problems 18-71

72 Example of Vector Addition Vector Computation 18-72

73 Matrix Multiplication (C = A x B) Vector Computation 18-73

74 Approaches Vector Computation General purpose computers rely on iteration to do vector calculations In example this needs six calculations Vector processing Assume possible to operate on one-dimensional vector of data All elements in a particular row can be calculated in parallel Parallel processing Independent processors functioning in parallel Use FORK N to start individual process at location N JOIN N causes N independent processes to join and merge following JOIN O/S Co-ordinates JOINs Execution is blocked until all N processes have reached JOIN 18-74

75 Processor Designs Vector Computation Pipelined ALU Within operations Across operations Parallel ALUs Parallel processors 18-75

76 Approaches to Vector Computation (a) Pipelined ALU Vector Computation 18-76

77 Approaches to Vector Computation (b) Parallel ALUs Vector Computation 18-77

78 Pipelined Processing Vector Computation 18-78

79 Ex. Computing C=(s x A) + B Vector Computation 1. Vector load A->Vector Register(VR1) 2. Vector load B->VR2 3. Vector multiply s x VR1->VR3 4. Vector add VR3 + VR2 -> VR4 5. Vector store VR4->C 18-79

80 Chaining Vector Computation Cray Supercomputers Vector operation may start as soon as first element of operand vector available and functional unit is free Result from one functional unit is fed immediately into another If vector registers used, intermediate results do not have to be stored in memory 18-80

81 Taxonomy of Computer Organization Vector Computation 18-81

82 Vector Taxonomy of Computer Organization Computation Parallel processors The multiple processors can function cooperatively on a given task Vector processor Often equated with pipelined ALU organization Also designed by a parallel ALU organization and a parallel processor Array processing Usually refers to a Parallel ALU But any of the three organizations is optimized for the processing of array Usually refers to an auxiliary processor attached to a generalpurpose processor and used to perform vector computation The pipelined ALU organization dominates the marketplace Less complex 18-82

83 IBM 3090 Vector Facility Vector Computation IBM facility makes use of a number of vector registers Each register is actually a bank of scalar registers To compute vector sum C=A+B, the vectors A and B are loaded into two vector registers The data from these registers are passed through the ALU as fast as possible The computation overlap, and the loading of the input data into the resisters in a block, results in a significant speeding up over an ordinary ALU operation

84 Organization Vector Computation The fixed and predetermined structure of vector data permits housekeeping instructions inside the loop to be replaced by faster internal machine operations Data-access and arithmetic operations on several successive vector elements can proceed concurrently by overlapping such operations in a pipelined design or by performing multiple-element operations in parallel The use of vector registers for intermediate results avoids additional storage reference 18-84

85 IBM 3090 with Vector Facility Vector Computation 18-85

86 Registers Vector Computation The IBM organization is referred to as register-to-register, because the vector operands, both input and output, can be stored in vector registers The main disadvantage of the use of vectro registers is that the programmer or compiler must take them into account 18-86

87 Alternative Programs Vector Computation 18-87

88 Alternative Programs (a) Storage to Storage Vector Computation 18-88

89 Alternative Programs Vector Computation (b) Register to Register 18-89

90 Alternative Programs (c) Storage to Register Vector Computation 18-90

91 Alternative Programs (d) Compound Instructions Vector Computation 18-91

92 Registers Vector Computation The vector registers can also be coupled to form 8 64-bit vector registers The architecture specifies that each register contains from 8 to 512 scalar elements Three additional registers are needed by the vector facility 18-92

93 Registers of the IBM 3090 Vector Computation 18-93

94 Compound Instruction Vector Computation Instruction execution can be overlapped using chining to improve performance The designers of the IBM vector facility chose not to include this capability for several reasons Compound instruction do not require the use of additional registers for temporary storage of intermediate results and they require one less register 18-94

95 The Instruction Set Vector Computation There are memory-to-register load and register-to-memory store instructions Many of instruction use a three operand format 18-95

96 IBM 3090 Vector Facility Vector Computation 18-96

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD