INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11 Title: Multiprocessors - Classification and Shared Architectures Summary: Multiprocessor classification; MIMD architectures (shared memory and distributed memory); coherency and consistency. 2010/2011 Nuno.Roma@ist.utl.pt

Architectures for Embedded Computing Multiprocessors: Classification and Shared Architectures Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 46 Previous Class with Shared In the previous class... Multiple-issue processors; Superscalar processors; Very Long Instrucion Word (VLIW) processors; Code optimization for multiple-issue processors; Multi-threading. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 46

Road Map with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 46 Summary with Shared Today: Multiprocessor classification; MIMD architectures: Shared memory; Distributed memory: Distributed shared memory; Multi-computers; coherency and consistency. Bibliography: Computer Architecture: a Quantitative Approach, Chapter 4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 46 Parallel Processing Objectives: with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46

Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46 Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Better use of the silicon space, by integrating several processors (cores) in a single chip: Chip Multiprocessador (CMP); Interconnection of several independent processors (e.g.: clusters, grids, etc.) Difficulties: Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46

Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Better use of the silicon space, by integrating several processors (cores) in a single chip: Chip Multiprocessador (CMP); Interconnection of several independent processors (e.g.: clusters, grids, etc.) Difficulties: Parallelizing the software... Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46 Parallel Processing Example: Homogeneous multi-core processor with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 46

Classification of Multi-Processor Systems Type Architecture Management Examples with Shared General Purpose Processor (GPP) Homogeneous Hardware - Intel, AMD, IBM Power, SUN etc. multi-core families Dedicated Processors / Accelerators Heterogeneous Misc. Hardware + Software - Cell (PS3) - GPUs (NVidia); - FPGA/ASIC dedicated accelerators. Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 46 Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46

Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46 Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Parallel execution in several processors in a single computer: Multiprocessors Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46

Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Parallel execution in several processors in a single computer: Multiprocessors Parallel execution in several computers: Clusters, Grids Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46 with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 46

Multiprocessor Classes SISD (Single Instruction, Single Data): uniprocessor case; with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46 Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46

Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46 Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; MIMD (Multiple Instruction, Multiple Data): each processor executes independent instructions over an independent data set. Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 46 with Shared More popular due to: Greater flexibility; Same components as the uni-processors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 46

with Shared More popular due to: Greater flexibility; Same components as the uni-processors. MIMD architectures can be divided into two classes: Shared (e.g.: multi-core processors); Distributed (e.g.: clusters, grids, etc.). Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 46 Shared with Shared Shared memory architecture, also known by: Uniform Access (UMA) or by: Symmetric Shared- Multiprocessors (SMP) Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 46

Distributed with Shared Distributed memory architecture, also known by: Non-Uniform Access (NUMA) Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 46 Shared vs Distributed with Shared In distributed architectures, most memory accesses are done in the local memory: Allows greater memory access bandwidths; Reduction of the access time. Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 46

Shared vs Distributed with Shared In distributed architectures, most memory accesses are done in the local memory: Allows greater memory access bandwidths; Reduction of the access time. However, Communication between processors is more complex; Increased access time to the data stored in the other processors local memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 46 with Shared with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 46

Shared with Shared Uniform Access (UMA) or Symmetric Shared- Multiprocessors (SMP) Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 46 MIMD Processing with Shared Example: Homogeneous multi-core processor with Shared sharing: Level 1 Caches (L1) - Private; Level 2 Caches (L2): Private (e.g.: AMD); Shared (e.g.: Intel); Level 3 Cache (L3) - Shared; Main memory - Shared. Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 46

Coherency with Shared Example, considering a write-through cache: Time Event Cache Cache up A up B Address X 0 1 1 up A reads M[X] 1 1 2 up B reads M[X] 1 1 1 3 up A 0 M[X] 0 1 0 In multi-processors, the migration and the replication of data are normal and expected events. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 46 Coherency with Shared A memory system is said to be coherent when the read operation from a given memory position returns the most recent value that was written into that memory position. Coherency: Defines which values can be returned in a read; Read and write access behavior to a certain memory position by a given processor. Consistency: Defines when a given written value is returned by a subsequent read; Read and write access behavior to a certain memory position by several different processors (synchronization). Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 46

Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46 Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; One read of M[X] by P i, after a write on M[X] by P j, always returns the value written by P j, if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46

Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; One read of M[X] by P i, after a write on M[X] by P j, always returns the value written by P j, if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. Writes to the same location are serialized; two writes to the same location by any two processors are seen in the same order by all processors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46 Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46

Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). What happens if a given processor is allowed to proceed while the write operation (slower) is taking place (e.g.: by using write-buffers)? It is possible that both P1 and P2 processors do not have access to the most recent values of B and A before the evaluation of the test condition. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46 Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). What happens if a given processor is allowed to proceed while the write operation (slower) is taking place (e.g.: by using write-buffers)? It is possible that both P1 and P2 processors do not have access to the most recent values of B and A before the evaluation of the test condition. Sequential consistency: the program only proceeds after all processors have been informed about the write operation. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46

Coherency Protocols with Shared The coherency protocols keep and check the status of the shared memory blocks: Snooping protocols: Each cache has a copy of the shared block s data and of the corresponding sharing status: there isn t any centralized status. Each cache controller listen to the memory bus, to determine whether or not it has a copy of the block that is being requested on the bus. Write-invalidate protocols; Write-update or broadcast protocols; Directory based protocols: the status of each shared block is kept in a centralized directory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 46 Snooping + Write-Invalidate Protocols with Shared Example, considering a write-through cache: up Bus Cache Cache Action Action up A up B Address X 0 up A reads M[X] Miss in X 0 0 up B reads M[X] Miss in X 0 0 0 up A 1 M[X] Invalidation of X 1 1 up B reads M[X] Miss in X 1 1 1 Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 46

Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46 Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; The access to the memory bus imposes a natural serialization of the simultaneous write operations; Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46

Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; The access to the memory bus imposes a natural serialization of the simultaneous write operations; The invalidation can be optimized by using an extra bit in the cache that indicates if that block s data is being shared or not: Valid Shared Meaning 0 - Invalid: the most recent value of that block is not present 1 1 Shared: that block is currently stored in several caches 1 0 Exclusive: currently, that block is only stored in this cache Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46 Snooping + Write-Update Protocols with Shared Also known as Broadcast Protocol: up Bus Cache Cache Action Action up A up B Address X 0 up A reads M[X] Miss in X 0 0 up B reads M[X] Miss in X 0 0 0 up A 1 M[X] Broadcast of X 1 1 1 up B reads M[X] 1 1 1 Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 46

Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46 Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46

Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. The delay between a write and a subsequent read (by other processor) is smaller with the broadcast protocol. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46 Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. The delay between a write and a subsequent read (by other processor) is smaller with the broadcast protocol. Invalidation protocols are by far the most used, since they require a much smaller bandwidth in the memory bus. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46

Directory Based Protocols with Shared In a directory based protocol the status of each block is kept in a centralized directory. Operations (just as before): Handle read misses; Handle writes to shared blocks; (Write misses correspond to these two, in sequence). Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 46 Directory Based Protocols with Shared Block status definition: Uncached: No processor has a copy of the cache block; Shared: One or more processors have the block cached, and the value in memory is up to date (as well as in all the caches); Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 46

Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46 Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Greater miss-rate; Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46

Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Greater miss-rate; Increase of the number of accesses to the central memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46 Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46

Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Coherency: Real: the word is really shared; False: miss due to simultaneous accesses by different processors to distinct words that belong to the same block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46 Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Coherency: Real: the word is really shared; False: miss due to simultaneous accesses by different processors to distinct words that belong to the same block. Global performance depends on: Number of processors; Capacity of each cache; Caches block size. (to be seen in the next classes) Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 46 Distributed Architecture with Shared Distributed memory architecture, also known as: Non-Uniform Access (NUMA) Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 46

MIMD Processing Examples: Cluster & Grids with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 46 Distributed Architecture with Shared In distributed memory architectures, it is necessary to transfer the data between the several different memories. Two approaches are usually adopted to manage this transfer: Distributed Shared (DSM): The memories are physically separated, but they are logically accessed in the same addressing space. Multi-computers: Logically separated addressing spaces: each node is just like an independent computer, with its own resources, which are not acceded by the remaining processing nodes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 46

Distributed Shared (DSM) with Shared Processors share the same addressing space: A given physical address points to the same memory position in the several existing processors; accesses with load and store instructions, independently of the target memory device (either local or remote); Access time depends on the target memory device (either local or remote (NUMA)). Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 46 Multi-Computers with Shared Each processor has its own resources and addressing space, working just like an independent computer: It is not different from a cluster; Data transfer between processors requires a specific communication system to exchange message between the processors (remote procedure call, RPC). Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 46

DSM vs Multi-Computers with Shared Advantages of DSM: Easiness to program and simplification of the compiler; Lower communication cost when reduced data volumes are transferred; Natural use of the caches. Advantages of Multi-Computers: Simpler hardware; Explicit communication; Easiness to emulate DSM. Prof. Nuno Roma ACE 2010/11 - DEI-IST 39 / 46 Coherency in DSM with Shared Snooping protocols are not viable! Solutions: Only private data is stored in cache; Directory based protocols. Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 46

Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46 Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. By software, it is possible to convert shared data to private data (by copying the block from the remote memory) Simplified hardware; There is little support using current compilers Left to the programmer responsibility! Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46

Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. By software, it is possible to convert shared data to private data (by copying the block from the remote memory) Simplified hardware; There is little support using current compilers Left to the programmer responsibility! However: Very complex implementation; Conservative approach: in case of doubt, the block is considered to be shared. Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46 Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46

Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Alternative: Distribute the directory in order to reduce the contention in acceeding the directory: each processor keeps local information concerning the set of shared blocks that are stored in its memory; Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46 Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Alternative: Distribute the directory in order to reduce the contention in acceeding the directory: each processor keeps local information concerning the set of shared blocks that are stored in its memory; Optimization to massive parallel systems (>200): Only keep information about the blocks that are effectively under use. Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46

Directory Based Protocols with Shared Block status definition: Uncached: No processor has a copy of the cache block; Shared: One or more processors have the block cached; Exclusive: Exactly one processor has a copy of the cache block, and it has written the block. Operations: Handle read misses; Handle writes to shared blocks; (Write misses correspond to these two, in sequence). Prof. Nuno Roma ACE 2010/11 - DEI-IST 43 / 46 New Problems in DSM Architectures with Shared There is no common bus: The bus cannot be used to arbitrate (serialize) the accesses; The operations are no longer atomic. The protocol is implemented with messages: All requests must have explicit answers. Prof. Nuno Roma ACE 2010/11 - DEI-IST 44 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 45 / 46 with Shared Syncronization and Multi-Processor Systems; SIMD Architectures (examples): Cell (STI - Sony, Toshiba, IBM); GPUs (NVidia, ATI). Prof. Nuno Roma ACE 2010/11 - DEI-IST 46 / 46