Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

Size: px
Start display at page:

Download "Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University"

Transcription

1 Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1

2 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system performance, but creates a problem of information integrity (coherence) of data in the caches of different processors. Its essence is that in the caches of different processors copies of the same data can be stored in memory. Some processor may update a data element of its copy. As a consequence, the contents of main memory and copies of caches of the remaining processors become unreliable. There are several methods to maintain coherency of caches, but in this chapter we will consider a method known as MESI (the origin of the abbreviations will be clear from what follows). It refers to the hardware techniques, often referred to as the implementation of a protocol to ensure information integrity. MESI protocol is used very often in symmetric multiprocessing systems (SMP-systems). All processors in it have exactly the same role (justifying the name symmetric). 2

3 An indispensable condition for the application of MESI protocol is the ability to track signals, supplied by the "active" processor, by the other processors. It occurs naturally in the application of shared system line (bus), where at any given time only one module puts the signals on the bus, and remaining modules can receive these signals. If to unite the modules the bus is not used, but the other type of switching environment is used, the implementation of this exact order of transactions execution is necessary for MESI algorithm functionality. MESI protocol requires that each cache line can be in one of the following four conditions that are written in two additional bits : М (Modified). Line is changed by the write instruction and this change is not reflected in the memory and, especially, in the caches of other processors. This line is reliable only in that cache line in which the change was made. E (Exclusive). This line contains the same data as the corresponding block in memory, and they are present only in this cache and absent in all others. S (Shared). This line contains the same data as the corresponding block in memory, and they are present not only in this cache, but in some others. I (Invalid). This line contains inaccurate or not updated data. Note, that the MESI acronym is composed of characters representing a specified state of a cache line. Consider the MESI protocol operation algorithm applied to the SMP - system with a shared bus. The next slide presents a diagram of status changes of the cache line of the processor, that initiates changes (named as an active processor) and the same diagram for any other processor that tracks changes on the shared bus (called as a tracking processor). 3

4 A diagram of status changes of the cache line of the processor, using the MESI protocol The diagram shows the possible states of a single cache line: M, E, S, I, and the possible transitions from one state to another, indicating the conditions under which this transition occurs. For the active state of the processor I - invalid line and m - line is not in the cache (miss) are united into one state I / m. Cache line consists of words. When executing a instruction an access to some word is needed and the processor initializes an operation read or write. These operations performing is started with checking if this word is in cache. If it is, hit event is set Else miss event h (hit), m (miss). 4

5 In this way, when analyzing the algorithm there are four situations that should be considered: Rh-read hit, Rm-read miss, Wh-write hit, Wm-write miss. Consider the case, when there is a line in the cache (hit), but it is in I state (Invalid). In this case, such as if there is no line (miss), this line should be read from the main memory at first and then placed into the cache. In this way, one united state I/m can be considered. Let us follow the logic of signals generation Rm, Rh, Wm, Wh. Suppose, the processor initiated read operation and put corresponding address on its address bus. Line is searched by this address in the cache and the next signals are produced: h (hit) or m (miss). So, we can see, that when miss (m), the Rm signal is produced, When hit (h) it is checked if the line is in I (Invalid) state or one of valid states M, E, S. If the line is invalid (I) then Rm signal - read miss is produced, If else Rh signal read hit is produced. In exactly the same way the signals Wm write miss and Wh write hit are produced when writing. Consider the MESI algorithm action in case of occurrence of each of four signals Rm, Rh, Wm, Wh. 5

6 In case Rm read miss, the line should be read from the main memory to an active processor cache. But this line in the main memory can be invalid, if its copy in the cache of any tracking processor is modified and is in state М (modified). In this case at first this line should be written from the cache of the tracking processor to the main memory and to the active processor cache. If there is no modified copy in any of tracking processors, then it can be read at once, but in any case states of all the copies must be set correctly. These actions are implemented the next way. At the signal Rm word address from the active processor bus is transmitted to the shared address bus. At the same time the signal track while reading is applied to the shared control bus (control bus contains a special line for this goal). All the tracking processors perceive these signals and determine if there is a copy of this line in a cache of each of them and if there is, then its state is fixed. Each tracking processor, in which cache there is a copy of this line in one of states M, E, S, gives an is in cache signal to the is in other cache line. 6

7 Depending on state of the found line the next options of operation outcome are possible. If the line state is М (modified), the tracking processor gives a signal is in cache and blocks the operation of reading from the active processor main memory, setting the signal on the block read line. At the same time it starts transmitting the line from its cache to the main memory and the active processor through the shared data bus. After that it changes the line state in its cache from М to S (shared). At the same time the active processor takes the line through the shared data bus, places it into its cache and sets the state S (shared). For the active processor it should be started with I/m state and move to S state, as the signal is in the other cache is set. Notice, that the line copy in the M state can be located only in one processor s cache. If the mode of the line is Е (Exclusive): A tracking processor activates the signal in cache and exchanges E for S. An active processor reads a line from base memory and changes its I/m (invalid) mode for S (shared). 7

8 If the mode of the line is S (shared), some tracking processors, which have copy of this line at S condition, can be found. All of them activate the signal in cache and keep the line at S mode. An active processor reads a line from base memory and changes its I/m for S. If there isn t any copy of the line in none of the tracking processor caches, no responses are appeared in the shared control bus lines. An active processor reads a line from base memory and changes its I/m mode for E. 8

9 When the desired word is detected by the processor in its cache, it is sent to the processor and cache line doesn t change its mode. At that, processor doesn t access the shared bus. Because the line, in which a word must be saved, is missed in cache or invalid, this line must be read at first, and than the word must be modified. An active processor starts read operation, setting an address of this word and a signal read with following modification on line in the bus The tracking processors define presence of this line s copy in their caches under the action of these signals. If copy is available, its state is registered. 9

10 If there is a modified copy of the line (M mode) in one of tracking processors, it activates block read signal, sends a copy of its line to a base memory and to an active processor via shared data bus and changes M mode for I/m, because the active processor is going to change this line. The active processor s taking the line in its cache, modifying it ( write operation) and changes I/m mode for M(modified). If there is no modified copy of the line in any of the tracking processors, read operation, which was started by the active processor, isn t blocked The active processor s taking the line in its cache, modifying it and changes I/m mode for M(modified). At the same time tracking processors with copies of the line in S (shared) mode changes S mode for I. If there are none of these processors, but there is a tracking processor with a copy of the line in E (Exclusive) mode, it changes E mode for I. 10

11 In this case, a word, which is needed to be changed, is in one of the lines of the active processor cache. What happens next depends on the current mode of this line. If the mode of this line is S (shared), then in one or few tracking processors there is a copy of this line in the mode S. Tracking processors have to change (each in their own cache) this line mode from S to l. In order to do this active processor announces its intention to make changes to the line with the signal "modification" in the control line. At the same time the active processor modifies the content of the line and changes its mode from S to M. If this line mode is E (monopoly copied), then no other processor has copies of this line. That s why active processor instantly updates the content of this line and changes its mode from E to M. If the mode of this line is M, then no other processor has the copies of this line. Active processor immediately updates the word in this line, leaving its former state as M. 11

12 One of the most successful SMP systems with MESI protocol is IBM ESA/390. The system appeared to be so close to the user, stable and clearly described that it is today considered to be open. It was high-performance system with more than ten processors and four main memory unit. IBM ESA/390 had been producing for 10 years from 1990 to Block scheme of such complex is represented on the next slide. Internal architecture of IBM S/390 computer system 12

13 PU Processor (CISC microprocessor). Each PU includes L1 cache 64 KB in size for instructions and data. L2 384 KB L2 cache. L2 cache units are combined in two into clusters, each cluster supports three processors and provides access to the entire memory addressing space. BSN switching network adapter, which organizes connection of L2 cache unit with one of four main memory units. BSN includes 2 MB L3 cache. Memory Card (Main memory unit, monoboard) of 8 GB. Each unit has its own controller, that is capable of processing requests at a high speed. As a result total bandwidth of memory access channel is multiplied by four. Connection between every processor (or rather between its L2 cache) and particular memory block is done via BSN switches. Each L2 cache stores data from only half of the memory address space. To access to the entire address space a pair of cache blocks is used, and every processor has access to its L2 cache pair. BSN commutator connects four communication channels with L2 caches into one logical trunk. Signal, that is supplied through any of four channels connected to L2 caches, is duplicated on three other channels, that provides MESI protocol work for data adjustments in L2 caches. The slide shows system performance data of IBM S/390 Access delay mark describes the time required for processor data extraction, if it is present in the specified memory subsystem block. In 89% of cases processor seeks out data item in its own L1 cache when it is prompted. 13

14 In the remaining 11% of cases it is necessary to contact the next level caches or main memory unit. In 5% of cases the required data item is located in L2 cache (access delay of 5 cycles), etc. Only in 3% of cases main memory block has to be accessed (access delay is 32 cycles). This figure would be twice as much (6%) without the thirdlevel cache. In multiprocessor systems based on the SMP-type - systems, there is a limit on the number of processors. A large number of processors requires high communication subsystem bandwidth. Shared bus does not provide the required bandwidth. Switching matrix of large dimension becomes too cumbersome and expensive. It is now believed that the number of processors in the SMP-system should be in the range of 16 to 64. The aspiration to increase the number of processors in the system while retaining the attractive qualities of SMP-system led to the idea of creating a NUMAsystem. 14

15 The NUMA-system has a plurality of nodes, each of which contains multiple processors and unified for all node processors memory block. The nodes are connected to each other with some communication subsystem. Though each node has its own memory block, there is a general global address space, which includes memory blocks of all the nodes and each cell of the global memory has a unique address, the same for all processors. If the required data block is located in a part of the global memory that is part of the node, a block of data is extracted from it via a local node trunk. If the data block is located in memory block of any other node, the request is automatically produced to the communication subsystem, which transmits the request to the appropriate local bus. This whole mechanism works automatically and it is clear for processor which referred to access memory. However, the time to access to a remote memory block is a lot longer than time to access your unit (local unit). If the cache-memory is not being used, the system is called NC-NUMA (No-Caching NUMA). If you have concerted caches, the system is called CC-NUMA (Coherent Cache NUMA). NUMA-systems have three key characteristics that all of them have and that collectively distinguish them from other multiprocessor systems: 15

16 There is one address space visible to all processors; Access to a remote memory block is produced by same read and write operations, as well as to a local block. Access to a remote memory block is slower than to a local block. Since cache-memory gives a big performance benefit, CC-NUMA-system is the most commonly used. However, they require a mechanism for maintaining the integrity of the information in the data caches. The typical structure of CC-NUMA-system is shown on the next slide. Every node is usually an SMP-system with a shared bus. Each processor has a cache blocks - L1 and L2 memory levels. Although embodiments of the data reconciliation mechanism in caches differ in details, its main feature is that each node should have some directory. Organization of CC-NUMA computing system 16

17 Directory contains information about where exactly each line of cache-memory is and what is its condition. When referring to the line of the cachememory data about where this line and whether it is changed or not is extracting from directory. Since an appeal to directory happens to each instruction appeals to memory, directory data should be in a high-speed specialized hardware that is able to give a response to the request for a share of the bus cycle. It may be feared that a great accessing time to remote node will reduce the productivity of the system with a large number of calls to remote nodes. But there is reason to believe that this fear is not justified. The use of two levels caches should minimize the amount of accesses to the main memory blocks, including remote nodes. If most of the data is used in the programs, largely localized, then the flow of appeals to remote memory units will not be intense. Studies have shown that this localization exists in the majority of typical applications. 17

18 SCI (Scalable Coherent Interface) is adopted as the IEEE 1569 standard and is supported by most of the leading manufacturers of computer equipment. SCI standard provides construction easy to implement, scalable, cost effective communication environment to combine processors and memory blocks, either creating a distributed network of workstations, or to arrange the input / output super computers and high-end servers. SCI withstand loads up to 64 K knots, and their address space can be up to 2 48 bytes. Consider the example of SCI on the Systems Sequent NUMA - Q 2000 and give numerical characteristics for this system, although SCI allows you to build a much larger systems. For example, Sequent NUMA - Q2000 contains 63 nodes and SCI allows expansion up to 64K nodes. The basic of the Sequent NUMA-Q2000 unit is the mainboard manufactured by INTEL. This board will keep four processors PENTIUM PRO and up to 4 GB of main memory. Each processor has a cache blocks - the first and second memory levels. Processors are combined in SMP - a shared bus system with MESI cache consistency protocol using. The data rate on the bus is 534 Mbytes / sec. The size of the cache line - memory is 64 bytes. The board has a socket (housing ) for the network controller. The system includes 63 system boards and with the help of these controllers they are connected in a single system, which block diagram is shown on the next slide. The network controller includes: 32 MB cache - memory; directory that keeps track of what is in the cache - memory; interface with the bus of the processor card and chip, called the information core, which is connecting the controller with other controllers. 18

19 NUMA-Q systems architecture Information core should send packages which are formed therein, possibly with simultaneous reception of other packages addressed to the node, and pass transit packages through themselves. For transit packages, which are arriving at the time of the node's packages transmission, bypassing FIFO queue is provided. To send and receive packages unit has input and output FIFO queue. Address decoder determines whether the packages fits to this node, and if yes, the package is routed to the input queue, otherwise bypassing queue. Packages are transmitted on channels, consisting of 18 lines (wire pairs): one synchronization line, one flag line and 16 data lines through which signals are transmitted in parallel. The channels are synchronized with a clock frequency of 500 MHz, and the data transfer speed of 1 GB / s. Signals are transmitted through the channels in one direction, so they must be joined in a ring. One 16 - bit data word is transmitted with each sync pulse. 19

20 The first word of the package - the address (number) of the recipient node ; the second word contains the fields: third word - address of the sender node. The fifth, sixth and seventh word 48-bit address of the main memory. A package may contain 0, 16 or 64 bytes of data. At the end of the package is a cyclic redundancy code, which is checked with package transferring. The first word of the package - the address (number) of the node - the recipient; The second word contains the fields flow control and instruction ; The third word - the node address of sender. The fifth, sixth and seventh words 48-bit main memory address. A package may contain 0, 16 or 64 bytes of data. At the end of the package is a cyclic redundancy code, checked with package transmission. 20

21 The system has two levels of cache consistency protocol. The SCI protocol supports consistency across all caches of network controllers. The MESI protocol maintains the consistency of caches of processors and cache of controller of each node. To have an idea of the relative amount of memory components, numerical characteristics for the system Sequent NUMA - Q2000 are given : The main physical memory is distributed across the nodes, each node has up to 4GB = 2 32 byte. Cache line size is 64 bytes = 2 6 byte, so each block of the main memory unit consists of 2 32 /2 6 =2 26 lines. When a line is not used, it is located only in one place - the corresponding node's main memory. But while using, a copy of the line can get into any cache. We are interested only in caches of network controllers. To keep track of copies of lines we need to maintain a table, in which for each line number of nodes is enumerated(address), caches in which contains a copy of a line. Note that this list for each line has a variable length and for the system of maximum size it may contain from 0 to bit numbers. In addition, the SCI interface was designed for easy expansion (scalability) of the system. And of course high performance is needed. SCI developers have found the following compromise between these conflicting requirements.. Firstly, the table is distributed between the nodes. The table of each node monitors only copies of lines corresponding to rows of the block assembly of main memory, and is called a table local memory. This table contains 2 26 rows for the local memory block size of 4 GB. Secondly a list of nodes containing the line copy is implemented as a doubly linked list. The elements of the list is a directory node elements and indicators - number of nodes. The use of such a list makes it easy to scale the system. 21

22 Directory is available in each node, the directory entry corresponds to a line cache - memory host controller. For the cache - memory of 32 MB = 2 25 bytes with lines of length 64 bytes = 2 6 bytes has 2 25 /2 6 = 2 19 lines, so each directory contains 2 19 elements. If a line is in only one cache memory, the node at which the line is located, is specified in the table memory local source node. If then this line appears in the cache of another node, in accordance with the protocol will indicate the source directory to the new element, which in turn points to the oldest element. Thus, it forms a two-element list. All new nodes that contain the same line, are added to the top of the list. Consequently, all the nodes that contain the line bind in an arbitrarily long list. The next slide illustrates this process (cache line is contained in three nodes simultaneously 4, 9 and 22). SCI protocol connects all holders of a given row in a doubly linked list 22

23 Each directory entry consists of 36 bits. Six bits indicate the node that contains the previous chain line (list) - reverse pointer. The following six bits indicate the node that contains the following line of the chain - the line pointer. Zero indicates the end of the chain, so the maximum size of the system consists of 63 nodes, and not 64. The next 7 bits are used to record status line. The last 13 bits - a tag that you need to identify the row. SCI protocol has three options for the complexity. The protocol of minimum complexity allows to have only one copy of each line in the cache. In accordance with the protocol of medium difficulty, each line can be cached in an unlimited number of nodes. The full protocol has additional tools to increase productivity. Let us consider the report of medium difficulty. Three list operations are defined in the SCI protocol: adding a node to the list, removing node from the list cleaning of all nodes. Although, many line states are used, we will consider only three of them : UNCACHED SHARED MODIFIED 23

24 UNCASHED state indicates, that the line is not contained in any of the network controllers caches. SHARED state that the line is located, at least, in one of the controllers cache-memory and it coincides with the corresponding main memory line. MODIFIED that the line was changed and that s why the main memory contains outdated data. Here is how to perform read operation If the processor, that performs this operation, is not able to find the necessary line on its board ( requesting node ), its controller sends a package to the node, in which local memory this line is contained ( source node ). Line state is defined. If state is uncached, the state is changed to shared. Then necessary line is taken from local memory of the source node and is placed into the cash of a controller, requesting the node. At the same time, an element of the directory of the requesting node is included into the list. If the line state is shared, the line is sent from the source node into the cache of requesting node and at the same time, directory element of requesting node is included in the beginning of the list. If the line is in modified state, the source node is not able to call this line from it s memory, because the memory contains an invalid copy of it. Instead of this, the number of node, cache-memory of which contains a valid copy of this line, is sent from the table of the local memory, and a read operation is organized. At the same time, line copies state is set as shared and the list is being corrected. 24

25 Let s consider the write operation performing: if the requesting node is located in the list, it removes all other elements, to remain the only one. If it is not in the list, it has to remove all the elements, and then to get into the list (as the only one). Line state is set as a modified. It is possible, that some of the nodes perform incompatible operations simultaneously. For protocol to operate correctly, a significant amount of other states is inserted. 25

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Organisasi Sistem Komputer

Organisasi Sistem Komputer LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Special Topics. Module 14: Directory-based Cache Coherence Lecture 33: SCI Protocol Directory-based Cache Coherence: Sequent NUMA-Q. Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Chapter 18. Parallel Processing. Yonsei University

Chapter 18. Parallel Processing. Yonsei University Chapter 18 Parallel Processing Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2 Types

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1) Topics

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Chapter 4 Main Memory

Chapter 4 Main Memory Chapter 4 Main Memory Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering fundamentals

More information

PCS - Part Two: Multiprocessor Architectures

PCS - Part Two: Multiprocessor Architectures PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2) Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

High Performance Multiprocessor System

High Performance Multiprocessor System High Performance Multiprocessor System Requirements : - Large Number of Processors ( 4) - Large WriteBack Caches for Each Processor. Less Bus Traffic => Higher Performance - Large Shared Main Memories

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont. CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework

More information

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017 CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

1993 Paper 3 Question 6

1993 Paper 3 Question 6 993 Paper 3 Question 6 Describe the functionality you would expect to find in the file system directory service of a multi-user operating system. [0 marks] Describe two ways in which multiple names for

More information

Unit 2. Chapter 4 Cache Memory

Unit 2. Chapter 4 Cache Memory Unit 2 Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation Location CPU Internal External Capacity Word

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Computer & Microprocessor Architecture HCA103

Computer & Microprocessor Architecture HCA103 Computer & Microprocessor Architecture HCA103 Cache Memory UTM-RHH Slide Set 4 1 Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:

More information

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter Seven. Idea: create powerful computers by connecting many smaller ones Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

COMP9242 Advanced OS. Copyright Notice. The Memory Wall. Caching. These slides are distributed under the Creative Commons Attribution 3.

COMP9242 Advanced OS. Copyright Notice. The Memory Wall. Caching. These slides are distributed under the Creative Commons Attribution 3. Copyright Notice COMP9242 Advanced OS S2/2018 W03: s: What Every OS Designer Must Know @GernotHeiser These slides are distributed under the Creative Commons Attribution 3.0 License You are free: to share

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

This Material Was All Drawn From Intel Documents

This Material Was All Drawn From Intel Documents This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made

More information

EE414 Embedded Systems Ch 5. Memory Part 2/2

EE414 Embedded Systems Ch 5. Memory Part 2/2 EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

MARIE: An Introduction to a Simple Computer

MARIE: An Introduction to a Simple Computer MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.

More information

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

More information

Characteristics. Microprocessor Design & Organisation HCA2102. Unit of Transfer. Location. Memory Hierarchy Diagram

Characteristics. Microprocessor Design & Organisation HCA2102. Unit of Transfer. Location. Memory Hierarchy Diagram Microprocessor Design & Organisation HCA2102 Cache Memory Characteristics Location Unit of transfer Access method Performance Physical type Physical Characteristics UTM-RHH Slide Set 5 2 Location Internal

More information

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

A NEW GENERATION OF TAG SRAMS THE IDT71215 AND IDT71216

A NEW GENERATION OF TAG SRAMS THE IDT71215 AND IDT71216 A NEW GENERATION OF TAG SRAMS THE IDT71215 AND IDT71216 APPLICATION NOTE AN-16 Integrated Device Technology, Inc. By Kelly Maas INTRODUCTION The 71215 and 71216 represent a new generation of integrated

More information