PCS - Part Two: Multiprocessor Architectures

Size: px

Start display at page:

Download "PCS - Part Two: Multiprocessor Architectures"

Agnes Price
5 years ago
Views:

1 PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors MultiCore Distributed Shared Memory Cache Coherency Memory Consistency Programming Models for Multiprocessor Systems

2 Multiprocessor Systems Structure: P 0 P 1 P 2 P 3 P (p 1) Cache Cache Cache Cache Cache Communication Network MEM MEM MEM MEM MEM Shared memory for all processors, i.e. a common address space Coordination and cooperation using shared variables in memory Computer runs with a single instance of the operating system Shared Memory Multiprocessors processors potentially operate independently and asynchronous cooperative operation obtained by software Control Processor 1 Instructions Processor N Instructions Control Arithm. Logical Data Arithm. Logical Data Communication Network global Memory 1 global Memory M

3 Symmetrical Multiprocessors (SMP) SMP - Symmetry in terms of same processor type and equal cost for memory access, independent of originating processor and of accessed physical address Example: Intel SMP Servers using FrontSide-Bus SMPs up to 32 processors, bigger systems do not perform well due to bottleneck of a common bus. MultiCore Many processor cores on a single chip. Used like a SMP system. Processors with private L1 Cache Private/shared L2 Cache All processors share a single memory interface. CPU1 CPU2 L1 Cache L1 Cache L2 Cache bus interface

4 Distributed Shared Memory (1) Mix concept: Distributed Shared Memory Hardware structure looks like a distributed memory architecture, OS techniques combined with hardware accelerations provide a virtual shared memory Processor Memory 1 Steuerwerk Instructions Processor Memory N Control Instructions Arithm. Logical Data Arithm. Logical Data Communication Network Memory Memory Distributed Shared Memory (2) Processor-Memory-s connected to a multiprocessor system Communication network mostly a hierarchic switched network Asymmetric Structure: Different memory access cost, depending on the referred address and the processor that originates the access Non Uniform Memory Access (NUMA, ccnuma when cache coherent)

Example: SUN SF15K (1) Sun SF15K: ccnuma-multiprocessor system, 72 Sun UltraSparc III - 900 Mhz 18 system boards with 4 processors and 4 memory modules each Within system board: UMA/SMP, cache

5 Example: SUN SF15K (1) Sun SF15K: ccnuma-multiprocessor system, 72 Sun UltraSparc III Mhz 18 system boards with 4 processors and 4 memory modules each Within system board: UMA/SMP, cache coherency by snooping Across different system boards: Directory-based cache coherency, implemented by SSM agents Example: SUN SF15K (2) memory access times (750 MHz UltraSPARC III) same CPU 216 ns same Board 235 ns different Board 375 ns Communication network: 18x18 Crossbar for address, + cache coherency control signals 18x18 Crossbar for data transfer

6 Shared Memory and Caching Caches are used in order to release network and main memory from frequent data transfer Non shared data can be kept in caches for a long time without interaction with main memory This improves scalability of the system, but introduces a consistency problem. This problem is solved by cache coherency protocols. Consistency problem: P1 Consistent P2 P1 Inconsistent P2 Write 23, : : : : : : : 2500 after "write back" Cache Coherency Coherency: Ensures that no old copies of data are used Weaker than consistency, i.e. inconsistencies are allowed but along with keeping track of inconsistencies Protocols: Invalidation: invalidate a copy when another processor is writing on address (snooping), always write-through is necessary MESI: keep track on usage of data, snooping, write-back only when necessary Directory based Cache Coherency: for systems without shared address bus

7 Cache Coherency: MESI (1) Motivation for MESI: Allow the Write-back strategy as long no other processor is accessing to the cached address Protocols similar to MESI also exist for DSM system without a shared snooping medium, directory-based caches The term MESI comes from the 4 states M, E, S and I Cache Coherency: MESI (2) M Exclusive Modified The line is exclusively in this cache and got modified (written) The line is exclusively in this cache E Exclusive Unmodified but was not modified, i.e. was only accessed by read operations S Shared Unmodified This line is also present in another processors cache, but was not modified Line was modified by another I Invalid processor, cache entry may not be used

8 Cache Coherency: MESI (3) States and transitions: local events: RM... read miss RH... read hit WM... write miss WH... write hit distant events: SHR... shared read SHW... shared write Dirty line copy back SHW Invalid WM RM / shared SHW SHW RH Shared unmodified SHR Invalidate WH Read with intent to modify Cache line fill WH exclusive Modified SHR WH RM / exclusive Exclusive unmodified RH RH Figure taken from: T. Ungerer, Parallelrechner und Parallele Programmierung Cache Coherency: MESI-like protocols MESI requires address bus visible for all caches, thus MESI solely appropriate for MultiCore and SMP systems DSM system: No shared address bus, instead a decentralized network for address and data transfer Directory-based cache coherence protocols: Each memory line is tagged with information which caches hold a copy of the line A distributed protocol is invoked each time a memory line or a cache line is accessed SSM-Agent runs coherency protocol on behalf of the local caches.

9 Example: SUN SF15K (3) Within a system board: Cache coherency by snooping and MESI Additionally, each system board contains a SSM-agent, working according a directory based cache coherency algorithm Principle: Cache coherency interactions remain local within board, as long in a presence vector no board-distant processor is stored. If a copy of a memory line is stored in a board-distant cache/processor, then SSM agent runs the distributed protocol. Example: SUN SF15K (3) Example - Invalidate cache copies after altering a memory line: SSM agent initiates transfer of the address across the 18x18 address-crossbar with control wires set to Invalidate the destination board is contained in a part of the address SSM-agent of the destination board receives the address, this address will be transferred via the local address bus with control signal set to Shared Write

Example: SUN SF15K (5) Memory Consistency Models (1) A memory consistency model determines in which order processes get notice of memory accesses by other processes.

10 Example: SUN SF15K (5) Memory Consistency Models (1) A memory consistency model determines in which order processes get notice of memory accesses by other processes. Is this really necessary, is there any problem? Normally not. But yes, because we introduced some optimizations into memory access. Speculative read operations and non blocking Caches Delayed write operations

11 Memory Consistency Models (2) Sequential Consistency - same result as sequential execution of operations in any order. Solely the local order from the view of the local processor is to keep. All processors see the same order. Processor Consistency - Order complies with local order of each processor, arbitrary mixture. Different processors may see different orders. Weak Consistency - Order solely guaranteed related to synchronization operations (Memory barriers) Release Consistency - Classification in concurrent and non-concurrent accesses, non-concurrent accesses are seen in a processor consistent way, concurrent accesses get ordered related to lock and release operations. Programming Models (1) Choice is mainly influenced by the aspect of shared memory, and cache coherency Options: Multiple processes (using fork) and communication via Shmem segments Multithreading: Threads run on different nodes and utilize parallel machine, Threads run onto a shared address space OpenMP - Set of compiler directives for controlling multi-threaded, space divided execution As well, more general programming models work on shared memory computers: Explicit message passing among multiple processes: Unix-Pipelines/Sockets, MPI

Programming Models: Multithreading Several threads run onto several processors under control of the operating system OS-specific thread functions, e.g. Solaris threads portability standard:

12 Programming Models: Multithreading Several threads run onto several processors under control of the operating system OS-specific thread functions, e.g. Solaris threads portability standard: POSIX-Threads, pthread library Basic functions: int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void*), void *arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr); Programming Models: OpenMP OpenMP: Example for Loop-parallelization: for (i=0;i<256;i++) #pragma omp parallel for for (j=0; j<256;j++) { img[i,j] = img[i,j]-minvalue; img[i,j] = (int) ( (float) img[i,j] * (float)maxvalue / (float)(maxvalue-minvalue) ); } Pragma-preprocessor-instruction tells compiler that for-loop is to parallelize

13 Summary Part 2 Multiprocessor systems with many processors, connected by a shared memory SMP, MultiCore and DSM Such systems scale up to 32 processors (SMP) and a few hundred processors (DSM) Cache Coherency allows to use caches transparently Programming models base on shared memory, e.g. multiple threads

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model