Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

Size: px

Start display at page:

Download "Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University"

Nickolas Gray
6 years ago
Views:

1 Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1

2 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system performance, but creates a problem of information integrity (coherence) of data in the caches of different processors. Its essence is that in the caches of different processors copies of the same data can be stored in memory. Some processor may update a data element of its copy. As a consequence, the contents of main memory and copies of caches of the remaining processors become unreliable. There are several methods to maintain coherency of caches, but in this chapter we will consider a method known as MESI (the origin of the abbreviations will be clear from what follows). It refers to the hardware techniques, often referred to as the implementation of a protocol to ensure information integrity. MESI protocol is used very often in symmetric multiprocessing systems (SMP-systems). All processors in it have exactly the same role (justifying the name symmetric). 2

3 An indispensable condition for the application of MESI protocol is the ability to track signals, supplied by the "active" processor, by the other processors. It occurs naturally in the application of shared system line (bus), where at any given time only one module puts the signals on the bus, and remaining modules can receive these signals. If to unite the modules the bus is not used, but the other type of switching environment is used, the implementation of this exact order of transactions execution is necessary for MESI algorithm functionality. MESI protocol requires that each cache line can be in one of the following four conditions that are written in two additional bits : М (Modified). Line is changed by the write instruction and this change is not reflected in the memory and, especially, in the caches of other processors. This line is reliable only in that cache line in which the change was made. E (Exclusive). This line contains the same data as the corresponding block in memory, and they are present only in this cache and absent in all others. S (Shared). This line contains the same data as the corresponding block in memory, and they are present not only in this cache, but in some others. I (Invalid). This line contains inaccurate or not updated data. Note, that the MESI acronym is composed of characters representing a specified state of a cache line. Consider the MESI protocol operation algorithm applied to the SMP - system with a shared bus. The next slide presents a diagram of status changes of the cache line of the processor, that initiates changes (named as an active processor) and the same diagram for any other processor that tracks changes on the shared bus (called as a tracking processor). 3

4 A diagram of status changes of the cache line of the processor, using the MESI protocol The diagram shows the possible states of a single cache line: M, E, S, I, and the possible transitions from one state to another, indicating the conditions under which this transition occurs. For the active state of the processor I - invalid line and m - line is not in the cache (miss) are united into one state I / m. Cache line consists of words. When executing a instruction an access to some word is needed and the processor initializes an operation read or write. These operations performing is started with checking if this word is in cache. If it is, hit event is set Else miss event h (hit), m (miss). 4

5 In this way, when analyzing the algorithm there are four situations that should be considered: Rh-read hit, Rm-read miss, Wh-write hit, Wm-write miss. Consider the case, when there is a line in the cache (hit), but it is in I state (Invalid). In this case, such as if there is no line (miss), this line should be read from the main memory at first and then placed into the cache. In this way, one united state I/m can be considered. Let us follow the logic of signals generation Rm, Rh, Wm, Wh. Suppose, the processor initiated read operation and put corresponding address on its address bus. Line is searched by this address in the cache and the next signals are produced: h (hit) or m (miss). So, we can see, that when miss (m), the Rm signal is produced, When hit (h) it is checked if the line is in I (Invalid) state or one of valid states M, E, S. If the line is invalid (I) then Rm signal - read miss is produced, If else Rh signal read hit is produced. In exactly the same way the signals Wm write miss and Wh write hit are produced when writing. Consider the MESI algorithm action in case of occurrence of each of four signals Rm, Rh, Wm, Wh. 5

6 In case Rm read miss, the line should be read from the main memory to an active processor cache. But this line in the main memory can be invalid, if its copy in the cache of any tracking processor is modified and is in state М (modified). In this case at first this line should be written from the cache of the tracking processor to the main memory and to the active processor cache. If there is no modified copy in any of tracking processors, then it can be read at once, but in any case states of all the copies must be set correctly. These actions are implemented the next way. At the signal Rm word address from the active processor bus is transmitted to the shared address bus. At the same time the signal track while reading is applied to the shared control bus (control bus contains a special line for this goal). All the tracking processors perceive these signals and determine if there is a copy of this line in a cache of each of them and if there is, then its state is fixed. Each tracking processor, in which cache there is a copy of this line in one of states M, E, S, gives an is in cache signal to the is in other cache line. 6

7 Depending on state of the found line the next options of operation outcome are possible. If the line state is М (modified), the tracking processor gives a signal is in cache and blocks the operation of reading from the active processor main memory, setting the signal on the block read line. At the same time it starts transmitting the line from its cache to the main memory and the active processor through the shared data bus. After that it changes the line state in its cache from М to S (shared). At the same time the active processor takes the line through the shared data bus, places it into its cache and sets the state S (shared). For the active processor it should be started with I/m state and move to S state, as the signal is in the other cache is set. Notice, that the line copy in the M state can be located only in one processor s cache. If the mode of the line is Е (Exclusive): A tracking processor activates the signal in cache and exchanges E for S. An active processor reads a line from base memory and changes its I/m (invalid) mode for S (shared). 7

If the mode of the line is S (shared), some tracking processors, which have copy of this line at S condition, can be found. All of them activate the signal in cache and keep the line at S mode.

8 If the mode of the line is S (shared), some tracking processors, which have copy of this line at S condition, can be found. All of them activate the signal in cache and keep the line at S mode. An active processor reads a line from base memory and changes its I/m for S. If there isn t any copy of the line in none of the tracking processor caches, no responses are appeared in the shared control bus lines. An active processor reads a line from base memory and changes its I/m mode for E. 8

9 When the desired word is detected by the processor in its cache, it is sent to the processor and cache line doesn t change its mode. At that, processor doesn t access the shared bus. Because the line, in which a word must be saved, is missed in cache or invalid, this line must be read at first, and than the word must be modified. An active processor starts read operation, setting an address of this word and a signal read with following modification on line in the bus The tracking processors define presence of this line s copy in their caches under the action of these signals. If copy is available, its state is registered. 9

10 If there is a modified copy of the line (M mode) in one of tracking processors, it activates block read signal, sends a copy of its line to a base memory and to an active processor via shared data bus and changes M mode for I/m, because the active processor is going to change this line. The active processor s taking the line in its cache, modifying it ( write operation) and changes I/m mode for M(modified). If there is no modified copy of the line in any of the tracking processors, read operation, which was started by the active processor, isn t blocked The active processor s taking the line in its cache, modifying it and changes I/m mode for M(modified). At the same time tracking processors with copies of the line in S (shared) mode changes S mode for I. If there are none of these processors, but there is a tracking processor with a copy of the line in E (Exclusive) mode, it changes E mode for I. 10

11 In this case, a word, which is needed to be changed, is in one of the lines of the active processor cache. What happens next depends on the current mode of this line. If the mode of this line is S (shared), then in one or few tracking processors there is a copy of this line in the mode S. Tracking processors have to change (each in their own cache) this line mode from S to l. In order to do this active processor announces its intention to make changes to the line with the signal "modification" in the control line. At the same time the active processor modifies the content of the line and changes its mode from S to M. If this line mode is E (monopoly copied), then no other processor has copies of this line. That s why active processor instantly updates the content of this line and changes its mode from E to M. If the mode of this line is M, then no other processor has the copies of this line. Active processor immediately updates the word in this line, leaving its former state as M. 11

12 One of the most successful SMP systems with MESI protocol is IBM ESA/390. The system appeared to be so close to the user, stable and clearly described that it is today considered to be open. It was high-performance system with more than ten processors and four main memory unit. IBM ESA/390 had been producing for 10 years from 1990 to Block scheme of such complex is represented on the next slide. Internal architecture of IBM S/390 computer system 12

13 PU Processor (CISC microprocessor). Each PU includes L1 cache 64 KB in size for instructions and data. L2 384 KB L2 cache. L2 cache units are combined in two into clusters, each cluster supports three processors and provides access to the entire memory addressing space. BSN switching network adapter, which organizes connection of L2 cache unit with one of four main memory units. BSN includes 2 MB L3 cache. Memory Card (Main memory unit, monoboard) of 8 GB. Each unit has its own controller, that is capable of processing requests at a high speed. As a result total bandwidth of memory access channel is multiplied by four. Connection between every processor (or rather between its L2 cache) and particular memory block is done via BSN switches. Each L2 cache stores data from only half of the memory address space. To access to the entire address space a pair of cache blocks is used, and every processor has access to its L2 cache pair. BSN commutator connects four communication channels with L2 caches into one logical trunk. Signal, that is supplied through any of four channels connected to L2 caches, is duplicated on three other channels, that provides MESI protocol work for data adjustments in L2 caches. The slide shows system performance data of IBM S/390 Access delay mark describes the time required for processor data extraction, if it is present in the specified memory subsystem block. In 89% of cases processor seeks out data item in its own L1 cache when it is prompted. 13

14 In the remaining 11% of cases it is necessary to contact the next level caches or main memory unit. In 5% of cases the required data item is located in L2 cache (access delay of 5 cycles), etc. Only in 3% of cases main memory block has to be accessed (access delay is 32 cycles). This figure would be twice as much (6%) without the thirdlevel cache. In multiprocessor systems based on the SMP-type - systems, there is a limit on the number of processors. A large number of processors requires high communication subsystem bandwidth. Shared bus does not provide the required bandwidth. Switching matrix of large dimension becomes too cumbersome and expensive. It is now believed that the number of processors in the SMP-system should be in the range of 16 to 64. The aspiration to increase the number of processors in the system while retaining the attractive qualities of SMP-system led to the idea of creating a NUMAsystem. 14

15 The NUMA-system has a plurality of nodes, each of which contains multiple processors and unified for all node processors memory block. The nodes are connected to each other with some communication subsystem. Though each node has its own memory block, there is a general global address space, which includes memory blocks of all the nodes and each cell of the global memory has a unique address, the same for all processors. If the required data block is located in a part of the global memory that is part of the node, a block of data is extracted from it via a local node trunk. If the data block is located in memory block of any other node, the request is automatically produced to the communication subsystem, which transmits the request to the appropriate local bus. This whole mechanism works automatically and it is clear for processor which referred to access memory. However, the time to access to a remote memory block is a lot longer than time to access your unit (local unit). If the cache-memory is not being used, the system is called NC-NUMA (No-Caching NUMA). If you have concerted caches, the system is called CC-NUMA (Coherent Cache NUMA). NUMA-systems have three key characteristics that all of them have and that collectively distinguish them from other multiprocessor systems: 15

There is one address space visible to all processors; Access to a remote memory block is produced by same read and write operations, as well as to a local block.

16 There is one address space visible to all processors; Access to a remote memory block is produced by same read and write operations, as well as to a local block. Access to a remote memory block is slower than to a local block. Since cache-memory gives a big performance benefit, CC-NUMA-system is the most commonly used. However, they require a mechanism for maintaining the integrity of the information in the data caches. The typical structure of CC-NUMA-system is shown on the next slide. Every node is usually an SMP-system with a shared bus. Each processor has a cache blocks - L1 and L2 memory levels. Although embodiments of the data reconciliation mechanism in caches differ in details, its main feature is that each node should have some directory. Organization of CC-NUMA computing system 16

17 Directory contains information about where exactly each line of cache-memory is and what is its condition. When referring to the line of the cachememory data about where this line and whether it is changed or not is extracting from directory. Since an appeal to directory happens to each instruction appeals to memory, directory data should be in a high-speed specialized hardware that is able to give a response to the request for a share of the bus cycle. It may be feared that a great accessing time to remote node will reduce the productivity of the system with a large number of calls to remote nodes. But there is reason to believe that this fear is not justified. The use of two levels caches should minimize the amount of accesses to the main memory blocks, including remote nodes. If most of the data is used in the programs, largely localized, then the flow of appeals to remote memory units will not be intense. Studies have shown that this localization exists in the majority of typical applications. 17

18 SCI (Scalable Coherent Interface) is adopted as the IEEE 1569 standard and is supported by most of the leading manufacturers of computer equipment. SCI standard provides construction easy to implement, scalable, cost effective communication environment to combine processors and memory blocks, either creating a distributed network of workstations, or to arrange the input / output super computers and high-end servers. SCI withstand loads up to 64 K knots, and their address space can be up to 2 48 bytes. Consider the example of SCI on the Systems Sequent NUMA - Q 2000 and give numerical characteristics for this system, although SCI allows you to build a much larger systems. For example, Sequent NUMA - Q2000 contains 63 nodes and SCI allows expansion up to 64K nodes. The basic of the Sequent NUMA-Q2000 unit is the mainboard manufactured by INTEL. This board will keep four processors PENTIUM PRO and up to 4 GB of main memory. Each processor has a cache blocks - the first and second memory levels. Processors are combined in SMP - a shared bus system with MESI cache consistency protocol using. The data rate on the bus is 534 Mbytes / sec. The size of the cache line - memory is 64 bytes. The board has a socket (housing ) for the network controller. The system includes 63 system boards and with the help of these controllers they are connected in a single system, which block diagram is shown on the next slide. The network controller includes: 32 MB cache - memory; directory that keeps track of what is in the cache - memory; interface with the bus of the processor card and chip, called the information core, which is connecting the controller with other controllers. 18

19 NUMA-Q systems architecture Information core should send packages which are formed therein, possibly with simultaneous reception of other packages addressed to the node, and pass transit packages through themselves. For transit packages, which are arriving at the time of the node's packages transmission, bypassing FIFO queue is provided. To send and receive packages unit has input and output FIFO queue. Address decoder determines whether the packages fits to this node, and if yes, the package is routed to the input queue, otherwise bypassing queue. Packages are transmitted on channels, consisting of 18 lines (wire pairs): one synchronization line, one flag line and 16 data lines through which signals are transmitted in parallel. The channels are synchronized with a clock frequency of 500 MHz, and the data transfer speed of 1 GB / s. Signals are transmitted through the channels in one direction, so they must be joined in a ring. One 16 - bit data word is transmitted with each sync pulse. 19

20 The first word of the package - the address (number) of the recipient node ; the second word contains the fields: third word - address of the sender node. The fifth, sixth and seventh word 48-bit address of the main memory. A package may contain 0, 16 or 64 bytes of data. At the end of the package is a cyclic redundancy code, which is checked with package transferring. The first word of the package - the address (number) of the node - the recipient; The second word contains the fields flow control and instruction ; The third word - the node address of sender. The fifth, sixth and seventh words 48-bit main memory address. A package may contain 0, 16 or 64 bytes of data. At the end of the package is a cyclic redundancy code, checked with package transmission. 20

21 The system has two levels of cache consistency protocol. The SCI protocol supports consistency across all caches of network controllers. The MESI protocol maintains the consistency of caches of processors and cache of controller of each node. To have an idea of the relative amount of memory components, numerical characteristics for the system Sequent NUMA - Q2000 are given : The main physical memory is distributed across the nodes, each node has up to 4GB = 2 32 byte. Cache line size is 64 bytes = 2 6 byte, so each block of the main memory unit consists of 2 32 /2 6 =2 26 lines. When a line is not used, it is located only in one place - the corresponding node's main memory. But while using, a copy of the line can get into any cache. We are interested only in caches of network controllers. To keep track of copies of lines we need to maintain a table, in which for each line number of nodes is enumerated(address), caches in which contains a copy of a line. Note that this list for each line has a variable length and for the system of maximum size it may contain from 0 to bit numbers. In addition, the SCI interface was designed for easy expansion (scalability) of the system. And of course high performance is needed. SCI developers have found the following compromise between these conflicting requirements.. Firstly, the table is distributed between the nodes. The table of each node monitors only copies of lines corresponding to rows of the block assembly of main memory, and is called a table local memory. This table contains 2 26 rows for the local memory block size of 4 GB. Secondly a list of nodes containing the line copy is implemented as a doubly linked list. The elements of the list is a directory node elements and indicators - number of nodes. The use of such a list makes it easy to scale the system. 21

22 Directory is available in each node, the directory entry corresponds to a line cache - memory host controller. For the cache - memory of 32 MB = 2 25 bytes with lines of length 64 bytes = 2 6 bytes has 2 25 /2 6 = 2 19 lines, so each directory contains 2 19 elements. If a line is in only one cache memory, the node at which the line is located, is specified in the table memory local source node. If then this line appears in the cache of another node, in accordance with the protocol will indicate the source directory to the new element, which in turn points to the oldest element. Thus, it forms a two-element list. All new nodes that contain the same line, are added to the top of the list. Consequently, all the nodes that contain the line bind in an arbitrarily long list. The next slide illustrates this process (cache line is contained in three nodes simultaneously 4, 9 and 22). SCI protocol connects all holders of a given row in a doubly linked list 22

23 Each directory entry consists of 36 bits. Six bits indicate the node that contains the previous chain line (list) - reverse pointer. The following six bits indicate the node that contains the following line of the chain - the line pointer. Zero indicates the end of the chain, so the maximum size of the system consists of 63 nodes, and not 64. The next 7 bits are used to record status line. The last 13 bits - a tag that you need to identify the row. SCI protocol has three options for the complexity. The protocol of minimum complexity allows to have only one copy of each line in the cache. In accordance with the protocol of medium difficulty, each line can be cached in an unlimited number of nodes. The full protocol has additional tools to increase productivity. Let us consider the report of medium difficulty. Three list operations are defined in the SCI protocol: adding a node to the list, removing node from the list cleaning of all nodes. Although, many line states are used, we will consider only three of them : UNCACHED SHARED MODIFIED 23

24 UNCASHED state indicates, that the line is not contained in any of the network controllers caches. SHARED state that the line is located, at least, in one of the controllers cache-memory and it coincides with the corresponding main memory line. MODIFIED that the line was changed and that s why the main memory contains outdated data. Here is how to perform read operation If the processor, that performs this operation, is not able to find the necessary line on its board ( requesting node ), its controller sends a package to the node, in which local memory this line is contained ( source node ). Line state is defined. If state is uncached, the state is changed to shared. Then necessary line is taken from local memory of the source node and is placed into the cash of a controller, requesting the node. At the same time, an element of the directory of the requesting node is included into the list. If the line state is shared, the line is sent from the source node into the cache of requesting node and at the same time, directory element of requesting node is included in the beginning of the list. If the line is in modified state, the source node is not able to call this line from it s memory, because the memory contains an invalid copy of it. Instead of this, the number of node, cache-memory of which contains a valid copy of this line, is sent from the table of the local memory, and a read operation is organized. At the same time, line copies state is set as shared and the list is being corrected. 24

Let s consider the write operation performing: if the requesting node is located in the list, it removes all other elements, to remain the only one.

25 Let s consider the write operation performing: if the requesting node is located in the list, it removes all other elements, to remain the only one. If it is not in the list, it has to remove all the elements, and then to get into the list (as the only one). Line state is set as a modified. It is possible, that some of the nodes perform incompatible operations simultaneously. For protocol to operate correctly, a significant amount of other states is inserted. 25

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD