INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Similar documents
INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Systems Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Systems Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiprocessors & Thread Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Shared Symmetric Memory Systems

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computer Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COSC 6385 Computer Architecture - Multi Processor Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer parallelism Flynn s categories

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Organization. Chapter 16

Parallel Architecture. Hwansoo Han

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 24: Virtual Memory, Multiprocessors

COSC4201 Multiprocessors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 18 Parallel Processing

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

Organisasi Sistem Komputer

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Issues in Multiprocessors

Multiprocessors - Flynn s Taxonomy (1966)

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Introduction II. Overview

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Multi-Processor / Parallel Processing

Lect. 2: Types of Parallelism

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Science 146. Computer Architecture

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Flynn s Classification

Comp. Org II, Spring

Parallel Processing & Multicore computers

! Readings! ! Room-level, on-chip! vs.!

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 9: MIMD Architecture

Parallel Architectures

Comp. Org II, Spring

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures

Issues in Multiprocessors

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Dr. Joe Zhang PDC-3: Parallel Platforms

Limitations of parallel processing

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Multiprocessors 1. Outline

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Computer Architecture

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

DISTRIBUTED SHARED MEMORY

Chapter 9 Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CSCI 4717 Computer Architecture

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Aleksandar Milenkovich 1

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Interconnect Routing

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Transcription:

UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11 Title: Multiprocessors - Classification and Shared Architectures Summary: Multiprocessor classification; MIMD architectures (shared memory and distributed memory); coherency and consistency. 2010/2011 Nuno.Roma@ist.utl.pt

Architectures for Embedded Computing Multiprocessors: Classification and Shared Architectures Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 46 Previous Class with Shared In the previous class... Multiple-issue processors; Superscalar processors; Very Long Instrucion Word (VLIW) processors; Code optimization for multiple-issue processors; Multi-threading. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 46

Road Map with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 46 Summary with Shared Today: Multiprocessor classification; MIMD architectures: Shared memory; Distributed memory: Distributed shared memory; Multi-computers; coherency and consistency. Bibliography: Computer Architecture: a Quantitative Approach, Chapter 4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 46 Parallel Processing Objectives: with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46

Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46 Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Better use of the silicon space, by integrating several processors (cores) in a single chip: Chip Multiprocessador (CMP); Interconnection of several independent processors (e.g.: clusters, grids, etc.) Difficulties: Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46

Parallel Processing with Shared Objectives: Greater performance; Efficient use of silicon resources; Reduction of power consumption. Implementation: Better use of the silicon space, by integrating several processors (cores) in a single chip: Chip Multiprocessador (CMP); Interconnection of several independent processors (e.g.: clusters, grids, etc.) Difficulties: Parallelizing the software... Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 46 Parallel Processing Example: Homogeneous multi-core processor with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 46

Classification of Multi-Processor Systems Type Architecture Management Examples with Shared General Purpose Processor (GPP) Homogeneous Hardware - Intel, AMD, IBM Power, SUN etc. multi-core families Dedicated Processors / Accelerators Heterogeneous Misc. Hardware + Software - Cell (PS3) - GPUs (NVidia); - FPGA/ASIC dedicated accelerators. Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 46 Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46

Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46 Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Parallel execution in several processors in a single computer: Multiprocessors Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46

Parallelism Levels with Shared Simultaneous execution of several sequential instruction phases: Pipelining Parallel execution of the instructions of a given application in a single processor: Superscalar processors and VLIWs Parallel execution in several processors in a single computer: Multiprocessors Parallel execution in several computers: Clusters, Grids Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 46 with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 46

Multiprocessor Classes SISD (Single Instruction, Single Data): uniprocessor case; with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46 Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46

Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46 Multiprocessor Classes with Shared SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; MIMD (Multiple Instruction, Multiple Data): each processor executes independent instructions over an independent data set. Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 46 with Shared More popular due to: Greater flexibility; Same components as the uni-processors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 46

with Shared More popular due to: Greater flexibility; Same components as the uni-processors. MIMD architectures can be divided into two classes: Shared (e.g.: multi-core processors); Distributed (e.g.: clusters, grids, etc.). Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 46 Shared with Shared Shared memory architecture, also known by: Uniform Access (UMA) or by: Symmetric Shared- Multiprocessors (SMP) Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 46

Distributed with Shared Distributed memory architecture, also known by: Non-Uniform Access (NUMA) Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 46 Shared vs Distributed with Shared In distributed architectures, most memory accesses are done in the local memory: Allows greater memory access bandwidths; Reduction of the access time. Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 46

Shared vs Distributed with Shared In distributed architectures, most memory accesses are done in the local memory: Allows greater memory access bandwidths; Reduction of the access time. However, Communication between processors is more complex; Increased access time to the data stored in the other processors local memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 46 with Shared with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 46

Shared with Shared Uniform Access (UMA) or Symmetric Shared- Multiprocessors (SMP) Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 46 MIMD Processing with Shared Example: Homogeneous multi-core processor with Shared sharing: Level 1 Caches (L1) - Private; Level 2 Caches (L2): Private (e.g.: AMD); Shared (e.g.: Intel); Level 3 Cache (L3) - Shared; Main memory - Shared. Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 46

Coherency with Shared Example, considering a write-through cache: Time Event Cache Cache up A up B Address X 0 1 1 up A reads M[X] 1 1 2 up B reads M[X] 1 1 1 3 up A 0 M[X] 0 1 0 In multi-processors, the migration and the replication of data are normal and expected events. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 46 Coherency with Shared A memory system is said to be coherent when the read operation from a given memory position returns the most recent value that was written into that memory position. Coherency: Defines which values can be returned in a read; Read and write access behavior to a certain memory position by a given processor. Consistency: Defines when a given written value is returned by a subsequent read; Read and write access behavior to a certain memory position by several different processors (synchronization). Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 46

Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46 Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; One read of M[X] by P i, after a write on M[X] by P j, always returns the value written by P j, if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46

Coherency with Shared A given memory system is said to be coherent if: One read of M[X] by P, after the write on M[X] by P, always returns the value that was written by P, provided that no more writes have been done by other processors between the write and read operations; One read of M[X] by P i, after a write on M[X] by P j, always returns the value written by P j, if the read and write are sufficiently separated in time and no other writes to M[X] occur between the two accesses. Writes to the same location are serialized; two writes to the same location by any two processors are seen in the same order by all processors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 46 Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46

Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). What happens if a given processor is allowed to proceed while the write operation (slower) is taking place (e.g.: by using write-buffers)? It is possible that both P1 and P2 processors do not have access to the most recent values of B and A before the evaluation of the test condition. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46 Consistency with Shared The consistency model of a memory system defines when a change in a memory position will be seen by all processors. P1: A = 0;. A = 1; L1: if(b == 0). P2: B = 0;. B = 1; L2: if(a == 0). What happens if a given processor is allowed to proceed while the write operation (slower) is taking place (e.g.: by using write-buffers)? It is possible that both P1 and P2 processors do not have access to the most recent values of B and A before the evaluation of the test condition. Sequential consistency: the program only proceeds after all processors have been informed about the write operation. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 46

Coherency Protocols with Shared The coherency protocols keep and check the status of the shared memory blocks: Snooping protocols: Each cache has a copy of the shared block s data and of the corresponding sharing status: there isn t any centralized status. Each cache controller listen to the memory bus, to determine whether or not it has a copy of the block that is being requested on the bus. Write-invalidate protocols; Write-update or broadcast protocols; Directory based protocols: the status of each shared block is kept in a centralized directory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 46 Snooping + Write-Invalidate Protocols with Shared Example, considering a write-through cache: up Bus Cache Cache Action Action up A up B Address X 0 up A reads M[X] Miss in X 0 0 up B reads M[X] Miss in X 0 0 0 up A 1 M[X] Invalidation of X 1 1 up B reads M[X] Miss in X 1 1 1 Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 46

Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46 Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; The access to the memory bus imposes a natural serialization of the simultaneous write operations; Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46

Snooping + Write-Invalidate Protocols with Shared With write-back caches, snooping also has to be used in memory reads, since the cache that holds the most recent data of the block has to transfer it into the bus; The access to the memory bus imposes a natural serialization of the simultaneous write operations; The invalidation can be optimized by using an extra bit in the cache that indicates if that block s data is being shared or not: Valid Shared Meaning 0 - Invalid: the most recent value of that block is not present 1 1 Shared: that block is currently stored in several caches 1 0 Exclusive: currently, that block is only stored in this cache Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 46 Snooping + Write-Update Protocols with Shared Also known as Broadcast Protocol: up Bus Cache Cache Action Action up A up B Address X 0 up A reads M[X] Miss in X 0 0 up B reads M[X] Miss in X 0 0 0 up A 1 M[X] Broadcast of X 1 1 1 up B reads M[X] 1 1 1 Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 46

Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46 Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46

Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. The delay between a write and a subsequent read (by other processor) is smaller with the broadcast protocol. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46 Comparison of Protocols with Shared Multiple writes to a given address cause: Broadcast protocol: multiple broadcasts; Write-invalidate snooping protocol: only one invalidation. Each write to a given shared block causes: Broadcast protocol: one broadcast; Write-invalidate snooping protocol: only one invalidation, corresponding to the first word that is written in that block. The delay between a write and a subsequent read (by other processor) is smaller with the broadcast protocol. Invalidation protocols are by far the most used, since they require a much smaller bandwidth in the memory bus. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 46

Directory Based Protocols with Shared In a directory based protocol the status of each block is kept in a centralized directory. Operations (just as before): Handle read misses; Handle writes to shared blocks; (Write misses correspond to these two, in sequence). Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 46 Directory Based Protocols with Shared Block status definition: Uncached: No processor has a copy of the cache block; Shared: One or more processors have the block cached, and the value in memory is up to date (as well as in all the caches); Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 46

Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46 Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Greater miss-rate; Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46

Performance of UMA Architectures with Shared In multi-processors with shared central memory, the contention to access the memory bus reduces the performance of each processor. In systems with write-invalidate snooping protocols: Increase of the number of invalidated cache positions; Greater miss-rate; Increase of the number of accesses to the central memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 46 Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46

Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Coherency: Real: the word is really shared; False: miss due to simultaneous accesses by different processors to distinct words that belong to the same block. Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46 Performance of UMA Architectures with Shared Cache misses: Compulsory; Capacity; Conflict: Coherency: Real: the word is really shared; False: miss due to simultaneous accesses by different processors to distinct words that belong to the same block. Global performance depends on: Number of processors; Capacity of each cache; Caches block size. (to be seen in the next classes) Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 46 Distributed Architecture with Shared Distributed memory architecture, also known as: Non-Uniform Access (NUMA) Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 46

MIMD Processing Examples: Cluster & Grids with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 46 Distributed Architecture with Shared In distributed memory architectures, it is necessary to transfer the data between the several different memories. Two approaches are usually adopted to manage this transfer: Distributed Shared (DSM): The memories are physically separated, but they are logically accessed in the same addressing space. Multi-computers: Logically separated addressing spaces: each node is just like an independent computer, with its own resources, which are not acceded by the remaining processing nodes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 46

Distributed Shared (DSM) with Shared Processors share the same addressing space: A given physical address points to the same memory position in the several existing processors; accesses with load and store instructions, independently of the target memory device (either local or remote); Access time depends on the target memory device (either local or remote (NUMA)). Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 46 Multi-Computers with Shared Each processor has its own resources and addressing space, working just like an independent computer: It is not different from a cluster; Data transfer between processors requires a specific communication system to exchange message between the processors (remote procedure call, RPC). Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 46

DSM vs Multi-Computers with Shared Advantages of DSM: Easiness to program and simplification of the compiler; Lower communication cost when reduced data volumes are transferred; Natural use of the caches. Advantages of Multi-Computers: Simpler hardware; Explicit communication; Easiness to emulate DSM. Prof. Nuno Roma ACE 2010/11 - DEI-IST 39 / 46 Coherency in DSM with Shared Snooping protocols are not viable! Solutions: Only private data is stored in cache; Directory based protocols. Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 46

Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46 Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. By software, it is possible to convert shared data to private data (by copying the block from the remote memory) Simplified hardware; There is little support using current compilers Left to the programmer responsibility! Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46

Coherency in DSM with Shared Implications arisen by only saving private data in cache: Reduction of the cache hit-rate. By software, it is possible to convert shared data to private data (by copying the block from the remote memory) Simplified hardware; There is little support using current compilers Left to the programmer responsibility! However: Very complex implementation; Conservative approach: in case of doubt, the block is considered to be shared. Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 46 Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46

Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Alternative: Distribute the directory in order to reduce the contention in acceeding the directory: each processor keeps local information concerning the set of shared blocks that are stored in its memory; Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46 Coherency in DSM with Shared Implications arisen by adopting directory based protocols: Information about the whole set of shared blocks: where they are and if they have been modified; Alternative: Distribute the directory in order to reduce the contention in acceeding the directory: each processor keeps local information concerning the set of shared blocks that are stored in its memory; Optimization to massive parallel systems (>200): Only keep information about the blocks that are effectively under use. Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 46

Directory Based Protocols with Shared Block status definition: Uncached: No processor has a copy of the cache block; Shared: One or more processors have the block cached; Exclusive: Exactly one processor has a copy of the cache block, and it has written the block. Operations: Handle read misses; Handle writes to shared blocks; (Write misses correspond to these two, in sequence). Prof. Nuno Roma ACE 2010/11 - DEI-IST 43 / 46 New Problems in DSM Architectures with Shared There is no common bus: The bus cannot be used to arbitrate (serialize) the accesses; The operations are no longer atomic. The protocol is implemented with messages: All requests must have explicit answers. Prof. Nuno Roma ACE 2010/11 - DEI-IST 44 / 46

with Shared Prof. Nuno Roma ACE 2010/11 - DEI-IST 45 / 46 with Shared Syncronization and Multi-Processor Systems; SIMD Architectures (examples): Cell (STI - Sony, Toshiba, IBM); GPUs (NVidia, ATI). Prof. Nuno Roma ACE 2010/11 - DEI-IST 46 / 46