Advances in Computer Architecture Lab Report

Size: px

Start display at page:

Download "Advances in Computer Architecture Lab Report"

Marylou Ball
5 years ago
Views:

1 Advances in Computer Architecture Lab Report H.M.F. Lohstroh February 26, 2011 Abstract As a lab exercise of the Advances in Computer Architecture course the task was given to build a discrete event simulation using SystemC that models the transactions between a number of processors, their private L1 D-caches and a shared main memory. A simple VALID-INVALID bus-snooping cache coherency protocol and the slightly more sophisticated optimized MOESI protocol found in AMD Athlon processors, were implemented and their performance was compared. The different implementations are discussed in detail and so are the obtained results. 1 Introduction In computer architecture, memory is organized hierarchically. The nearer to the processor, the smaller and faster are the components. All together these components allow data to flow through the system. Most essential are the registers in the processor that store the operands and results of all computations. Although a program is typically stored on non-volatile storage such as a hard disk, before execution it is first loaded into main memory, which is much faster, but still significantly slower than the registers. As an intermediate step in the hierarchy, caches were invented to avoid the penalties of memory access by keeping the most recently used data and deliver is much faster to the processor. The first processor to place a cache directly onto the processor die was Intels 486. Released in 1989, in contained 8 KB of cache. Since then, both processor and memory memory cycle times have been decreasing exponentially over time, but processor clock frequency scaled must faster then memory latency. Due to this growing disparity between processor and memory speed, memory tends to be a overwhelming bottleneck in performance. Hence in subsequent designs more and more area on the processor die has been dedicated the purpose of caching. In fact, modern designs often feature a multitude of cache levels, where one part (L1, L2) resides on the die and another part (L3) is off-chip. Together with an ongoing miniaturization that allows for a denser storage capacity, modern processors typically comprise several Megabytes of cache. Given regular access patterns, which are typical in most computer programs, caches exploit the benefits of spatial and temporal locality of the data in memory. The idea is simple and effective, each memory request goes through the cache 1

2 and is only forwarded to the main memory (or a higher level cache) if the requested data is not present in that cache. More difficult it becomes when multiple processors with local caches are incorporated in a shared memory system. If caches keep private copies of shared data while being unaware of the state of copies in other caches, undefined behavior results. Therefore, a cache coherence protocol is required to maintain consistency of the data stored in local caches. These protocols consist of a set of cache line state transitions that can be elegantly captured in a diagram. Nevertheless it is difficult to get a grasp of the actual behavior they describe, let alone prove their correctness. The aim of this project was to implement two different cache coherency protocols in an event driven simulation using SystemC, which is a system design language based on C++. SystemC uses processes to model concurrency in a cooperative multi-tasking fashion i.e., the concurrency is not pre-emptive. Each process in the simulator executes a small chunk of code, and then voluntarily releases control to let another processes, chosen ad random, to execute in the same simulated time space. Such a time slice is referred to as a delta cycle. Events are propagate through the system by means of ports and signals which can be used to transfer data cause interrupts. All modules are sensitive to a shared clock, which a means that each module is woken up once every clock cycle to do a share of work. After implementing a single processor simulation, a multiprocessor simulation was implemented with first a snoopy write-invalidate protocol and then a slightly more sophisticated protocol known as MOESI, which features additional state that allow sharing of cache lines among multiple caches even if they have been written to once already. All three implementation are explained in detail after which the obtained results are discussed. 2 Single processor system The single core simulation uses a 32KB 8-way set-associative L1 D-cache with a 32-byte line size. The associativity of the cache makes that the address space is mapped onto sets of 8 slots based on only part of the full 32-bit address. The given parameters determine how the address should be interpreted. Assuming that the reads and writes are byte-addressed, a 32 byte cache line requires log(32) / log(2) = 5 bytes to determine the offset. Having 32 KB of cache in total, it must consist of / 32 = 1024 slots, subdivided over 1024 / 8 = 128 sets. Hence to determine the set, another log(128)/log(2) = 7 bits are needed. Finally, the remaining 32 - (5+7) = 20 bits denote the tag. In the implementation these numbers are calculated using defines, i.e., the properties of the cache can be conveniently changed at compile time tag set offset Figure 1: Interpretation of a 32-bit address. The actual cache lines are given shape in a C-struct holding two Booleans to keep track of whether a line is dirty and/or valid, along with a timestamp 2

3 that is updated each time a line is read or written and finally of course the tag that is needed calculate the original address in memory. A least-recently-used, write-back replacement strategy was implemented, which in practice means that upon a read or write, after looking up the set, an iteration starts over the set members to find a matching tag. In case of a cache miss, a new line has to be allocated. If all lines in the designated set are occupied, the least recently used cache line is evicted. If the evicted line is dirty, it has to be written to memory first as the write-back scheme defers memory access until eviction. Less obvious but of equal importance, is that upon a write-miss it does not suffice to simply evict a cache line and place the newly written value into the cache. Since the processor reads and writes 32-bit words and the cache stores 32 byte lines, the remaining 28 bytes must also be fetched from memory to properly initialize the new cache line. 3 Multi processor system 3.1 Scope and limitations The basic idea of the simulation is to evaluate the performance of the two cache coherency protocols using a set of cooked up trace files that allow for a fair and repeatable experiment. However, the simulation might seem to exhibit the desired or expected behavior, but this is by no means a guarantee of the correctness of the actual implementation. And of course, benchmarking a coherency protocol that doesn t preserve coherency is completely useless. At first sight, the weak spot seemed to be the fact that the traces solely govern read and write operations and their associated addresses. Therefore the simulation does not actually move around data that could potentially be used to verify whether a processor indeed reads the values it is supposed to read. Then again, had the trace files also included data, correct output for one trace does not ensure correct output for another i.e., this approach does not provide a general proof of correctness either. Moreover, an element of non-determinism is incorporated in the fashion in which caches concurrently attempt to access the bus. In case two caches should write to the very same address, of course only either of the two succeeds in writing its value back to memory, which one is up to chance. This however is not an issue the hardware should or even could deal with, instead if undesired it should be resolved programmatically. Cache coherency is a requisite for preserving data consistency, but not visa versa. So the problem of verification boils down to ensuring the cache lines states honor the prescribed state transition diagram. This could be deduced from the source code of the simulation itself, but needless to say this is rather error prone. On the other hand, what the protocol should achieve is that no two copies of a single cache line end up in a mutually exclusive state. This is something that can be checked during each time step during the simulation using an active module that has access to the cache line states of each cache. The designated point for such activity is of course the bus, therefore an optional coherency check was build into the bus module, to verify the correctness of the simulated protocols, given a certain trace file. 3

4 3.2 General design In contrast with the single processor implementation that communicated to the memory module directly, all communication now had to be handled through a bus module that had to be designed from scratch. In order to reduce the complexity of the cache modules that already needed to take care of all cache line state transitions, it seemed reasonable to centralize the logic involved with arbitrage and communication inside the bus module. By describing interfaces for the modules connected to the bus and the bus itself, it was possible to reduce the number of ports and signals, which enhances the structure of the code. Another benefit of the design is that it indeed allows for the verification code to work effectively like proposed in section 3.1. Namely, it ensures that each cache takes notice of bus probes in the exact same delta cycle and just before every cache line is checked for consistency. In case caches would have to wait for the simulator to gain control and to take notice of an event that occurred in a previous delta cycle, the sanity check might wrongly report an unpermitted combination of states as a read or write probe might not have reached all caches at the time of checking. This example exposes a non-trivial but important artifact of the System-C concurrency model. class c a c h e i f : virtual public s c i n t e r f a c e { public : virtual Line g e t l i n e s ( ) = 0 ; virtual bool probe ( Request r ) = 0 ; virtual void r e c e i v e ( Response r ) = 0 ; virtual void wake up ( ) = 0 ; } ; class memory if : virtual public s c i n t e r f a c e { public : virtual void i s s u e ( Request r ) = 0 ; } ; class b u s i f : virtual public s c i n t e r f a c e { public : virtual Hazard q u e u e r e q u e s t ( Request r ) = 0 ; virtual void c a n c e l r e q u e s t ( Request r ) = 0 ; virtual void q u e u e r e s p o n s e ( Response r ) = 0 ; } ; Listing 1: Interfaces as found in task 3.h. The arbitrage of bus access is implemented as follows. The bus module implements a method queue request() to be called by caches to queue a request, cancel request() to redraw an earlier filed request, and queue response() for the memory module to queue a response for the bus to be handled. All methods as described in the bus interface are non-blocking, and instead of letting the caches wait until it acquires but access, the bus maintains one queue for requests and one for responses internally. This way, through a return value it can also immediately notify a cache about a possible hazard with regard to the address a newly filed request is concerned with. Depending on the return value a cache can decide e.g. whether it had a successful write read hit, or in fact suffered a write miss due to a write-after-write (WAW) dependency i.e., another cache already had a write request pending in the request queue. In that case it has to cancel its write request and instead file a new read request after which it 4

5 can retry to put write probe on the bus. The receive() method implemented by the cache module (and thus invoked by the bus) also takes care of signalling the processor module to resume its computation in the event that an awaited response is received, either from memory or another cache holding a valid copy of the requested line. In each delta cycle the bus checks whether the response queue is empty and if not, it propagates the first response in line to the caches using their receive() method. If no responses are pending, the first request in line is handled by probing the caches using their probe() method. In the event of a write request this method allows the caches enforce the prescribed state transitions for their local copies, and in the event of a read request the bus determines whether it can issue a direct cache line transfer or needs to forward the request to memory. Finally, the memory module was given shape. As it was assumed to be pipelined and therefore able to handle one request in each cycle, it is easily argued that it could be omitted from the simulation entirely as caches could simply wait for the assumed latency of 100 cycles and continue operating as if they had received the requested value. But as a side effect of this approach the contention measured on the bus would basically halve. Moreover, due to the latency of 100 cycles it can generally not be assumed that a read issued only a few cycles after a write that regarded the same address, would yield a correct response. More importantly, in the event of memory answering a read request while in the time span of its 100 cycles latency a write request was issued regarding the same address, it would certainly not be acceptable to send the stale value back to the bus. Similarly to the cache coherency protocols, two different schemes could be used to resolved this problem; write-invalidate, or write-update. In practice, write-invalidation would require an administration of active requests and logic to reschedule requests if needed. On the other hand, write-update would require a fast buffer containing not only the addresses associated with the requests, but also the concerned data. In a sense, the write-update scheme forces memory to act much like a cache itself. For the implementation of the memory module the write-invalidate scheme was chosen as it seemed most conservative and more inline with the used cache coherency protocols. In addition, it also allowed for small optimizations like the elimination of duplicate read requests. 3.3 VALID-INVALID protocol The first protocol to be implemented was a VALID-INVALID protocol, the simplest of bus-snooping cache coherency protocols. Bus-snooping means that all caches monitor the activity on the bus which each cache can only write to exclusively. Upon a write probe hit i.e., once a cache notices that another processor just wrote to an address of which it owns a copy, the cache line containing a stale copy of the associated memory segment is invalidated. Using the get lines() method implemented by the cache module, the bus can retrieve the contents of the caches and compare every cache line with every other cache line present in the same set of every other cache. Because of the significant performance drawbacks of running this computation every time step in the simulation, it can be enabled or disabled using a single define. The logic in this function is not earth-shattering as the protocol simply forbids two cache lines to be valid in different caches provided they are mapped to the same set 5

6 and share the same tag. After serving for as a useful problem indicator while debugging, in the end coherency check showed no errors using any of the traces. 3.4 MOESI protocol The MOESI protocol is an extension of the MESI protocol which is the most common protocol that supports a write-back policy. The MESI protocol has four states (modified, exclusive, shared, invalid) and in a sense implements the same invalidation scheme as our simple VALID-INVALID protocol, only it keeps track of whether a cache line is shared or not. It only allows caches to dirty a cache line if it is in modified or exclusive state. Only if it is shared, the other copies need to be invalidated first. The MOESI protocol adds a fifth owned state that is the intersection of the otherwise mutually exclusive modified and shared state. By allowing modified cache lines to be shared, data need not be written to memory before sharing it. Instead the write-back operation is deferred. Proper insight in the workings of the MOESI protocol allowed for a design of the framework as it was used to implement the VALID-INVALID protocol that was well fitted to extend to this slightly more complex protocol. Only minor changes were needed as for example as e.g. letting the cache line C-struct hold an enum of MOESI-states instead of two Booleans. And of course the cache modules were adapted to carry out the additional state transitions. As part of the assignment, also a barrier synchronization was implemented. The processors read synchronization requests from the trace file and forward them to their caches like an ordinary read or write request. The caches on their turn forward the synchronization requests to the bus as it were tokens after which they turn into a sleeping mode. The cache interface was extended with a wake up() method for the bus to call whenever all processors reached a checkpoint. As an option, again to be enabled or disable using a single define, synchronization was also implemented for the VALID-INVALID protocol to make up for a fair comparison with the MOESI protocol. M O E S I Modified Owned Exclusive Shared Invalid Table 1: States given two copies of the same cache line held in different caches. Similar to the implementation of the VALID-INVALID protocol, if desired the bus can verify coherency by comparing cache lines states in each time step of the simulation. But now instead of checking for a single disallowed combination of states, as shown in Table 1, several combinations of state exist that are prohibited by the protocol. Again, after providing useful debugging information, the verifier turned silent using any of the available traces. 6

7 4 Results Aside a set of debugging trace files, two sets of trace files were available for benchmarking: one set that is supposed to mimic the read and write operations of a Fast Fourier Transfer (FFT) and another one that should feature reads and writes to random memory locations. Both sets offer files that contain traces for 1, 2, 4, or 8 processors. A number of experiments were conducted, each time using both sets of benchmarking traces, but every time using different configuration options. The first experiment ignored the barrier synchronization events read from the trace files. The second experiment performed a barrier synchronization between all processors each time such event was read from the trace files. A last experiment was done with synchronization enabled but snooping disabled, as was suggested in the assignment. It must be noted that the latter would force the system into an inconsistent state which is in practice of course not desirable. (a) Using FFT trace. (b) Using random trace. Figure 2: Average hit rate. (a) Using FFT trace. (b) Using random trace. Figure 3: Average probe read hit rate. The first thing that became evident is that looking at the hit rates as displayed in Figure 2, there were no major differences between both protocols, nor could any of the different configurations be distinguished in terms of hit rate performance. That is, given the used trace files. This was rather unexpected as the MOESI protocol is supposed to be more efficient. The advantage of deferring writes to memory however does not necessarily increase the likelihood of finding a needed cache line locally, as a write would otherwise only invalidate cache lines in remote caches. Also, for the neighboring caches the impact of such 7

8 invalidation is reasonably small as a future request for write-invalidated line can be handled using a direct cache line transfer. Unless there is much contention on the bus, not many cycles are wasted. More striking was the fact that the synchronized run with the VALID- INVALID protocol as displayed in Figure 3(b), outperformed MOESI in the number of probe read hits given random reads and writes. Then again, randomness is exactly what caches cannot anticipate for, nor could a coherency protocol improve a cache its ability to do so. (a) Using FFT trace. (b) Using random trace. Figure 4: Bus contention. Contention was measured in the bus module by counting the number of delta cycles in which there was more than a single request to be handled by the bus i.e., when the sum of elements in the request- en response queue was greater than one. The ratio of cycles the bus was in contention to the number of cycles that passed in the simulation expresses the contention as displayed Figure 4. A first observation regards the decreased in contention for the synchronized runs. Obviously, synchronization reliefs the bus as it forces processors to wait until they reach the barrier instead of continuing to race towards the end of their trace. Also, the MOESI protocol seems to have a smaller footprint on the bus, which might be attributed to deferred writes. However unexpected, it must be noted that MOESI used more memory writes then the VALID-INVALID did. Most likely this is due to a bug in the accounting of memory writes or in the simulation itself. (a) Using FFT trace. (b) Using random trace. Figure 5: Simulation time. After studying the preceding graphs, it is not very surprising to see that in terms of simulation time as displayed in Figure 5, both protocols are again 8

9 comparable. In fact the VALID-INVALID protocol is in most cases outperforming MOESI, which is rather disappointing. Also, the overhead induced by the barrier synchronization clearly shows in both graphs. For less then four processors, the asynchronous run with the MOESI protocol using the random trace is slightly faster than VALID-INVALID using the same configuration. In order to get a better idea of the numbers behind the conducted experiments, some raw program output is available in Appendix A. 5 Conclusions All parts of the assignment were completed successfully and the correctness of the implementations with regard to their preservation of coherency is verified in the simulations themselves. Although this cannot be accepted as a general proof of correctness, at least the runs of which the results were discussed in this report never ended up in an inconsistent state. The results of the experiments were a bit disappointing as the comparison between both protocols showed no clear winner in terms of performance. It appears that the provided trace files do no exhibit the behavior that lets the advantage of the MOESI protocol shine through. A more general but nevertheless important lesson could be learned from this, which is that proper cache utilization largely depends on the memory access pattern at hand. It signifies the importance of understanding the dynamics of caching, especially in multi-core shared memory systems where the issue of preserving cache coherency comes into play. A Program output The following listings cover the output of 8-core experiments with barrier synchronization using the two different benchmark traces. Output listing 1 task 2.bin trace files-v3/fft 16 p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns 9

10 Output listing 2 task 3.bin trace files-v3/fft 16 p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns Output listing 3 task 2.bin trace files-v3/rnd p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns Output listing 4 task 3.bin trace files-v3/rnd p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns 10

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the