Advances in Computer Architecture Lab Report

Size: px
Start display at page:

Download "Advances in Computer Architecture Lab Report"

Transcription

1 Advances in Computer Architecture Lab Report H.M.F. Lohstroh February 26, 2011 Abstract As a lab exercise of the Advances in Computer Architecture course the task was given to build a discrete event simulation using SystemC that models the transactions between a number of processors, their private L1 D-caches and a shared main memory. A simple VALID-INVALID bus-snooping cache coherency protocol and the slightly more sophisticated optimized MOESI protocol found in AMD Athlon processors, were implemented and their performance was compared. The different implementations are discussed in detail and so are the obtained results. 1 Introduction In computer architecture, memory is organized hierarchically. The nearer to the processor, the smaller and faster are the components. All together these components allow data to flow through the system. Most essential are the registers in the processor that store the operands and results of all computations. Although a program is typically stored on non-volatile storage such as a hard disk, before execution it is first loaded into main memory, which is much faster, but still significantly slower than the registers. As an intermediate step in the hierarchy, caches were invented to avoid the penalties of memory access by keeping the most recently used data and deliver is much faster to the processor. The first processor to place a cache directly onto the processor die was Intels 486. Released in 1989, in contained 8 KB of cache. Since then, both processor and memory memory cycle times have been decreasing exponentially over time, but processor clock frequency scaled must faster then memory latency. Due to this growing disparity between processor and memory speed, memory tends to be a overwhelming bottleneck in performance. Hence in subsequent designs more and more area on the processor die has been dedicated the purpose of caching. In fact, modern designs often feature a multitude of cache levels, where one part (L1, L2) resides on the die and another part (L3) is off-chip. Together with an ongoing miniaturization that allows for a denser storage capacity, modern processors typically comprise several Megabytes of cache. Given regular access patterns, which are typical in most computer programs, caches exploit the benefits of spatial and temporal locality of the data in memory. The idea is simple and effective, each memory request goes through the cache 1

2 and is only forwarded to the main memory (or a higher level cache) if the requested data is not present in that cache. More difficult it becomes when multiple processors with local caches are incorporated in a shared memory system. If caches keep private copies of shared data while being unaware of the state of copies in other caches, undefined behavior results. Therefore, a cache coherence protocol is required to maintain consistency of the data stored in local caches. These protocols consist of a set of cache line state transitions that can be elegantly captured in a diagram. Nevertheless it is difficult to get a grasp of the actual behavior they describe, let alone prove their correctness. The aim of this project was to implement two different cache coherency protocols in an event driven simulation using SystemC, which is a system design language based on C++. SystemC uses processes to model concurrency in a cooperative multi-tasking fashion i.e., the concurrency is not pre-emptive. Each process in the simulator executes a small chunk of code, and then voluntarily releases control to let another processes, chosen ad random, to execute in the same simulated time space. Such a time slice is referred to as a delta cycle. Events are propagate through the system by means of ports and signals which can be used to transfer data cause interrupts. All modules are sensitive to a shared clock, which a means that each module is woken up once every clock cycle to do a share of work. After implementing a single processor simulation, a multiprocessor simulation was implemented with first a snoopy write-invalidate protocol and then a slightly more sophisticated protocol known as MOESI, which features additional state that allow sharing of cache lines among multiple caches even if they have been written to once already. All three implementation are explained in detail after which the obtained results are discussed. 2 Single processor system The single core simulation uses a 32KB 8-way set-associative L1 D-cache with a 32-byte line size. The associativity of the cache makes that the address space is mapped onto sets of 8 slots based on only part of the full 32-bit address. The given parameters determine how the address should be interpreted. Assuming that the reads and writes are byte-addressed, a 32 byte cache line requires log(32) / log(2) = 5 bytes to determine the offset. Having 32 KB of cache in total, it must consist of / 32 = 1024 slots, subdivided over 1024 / 8 = 128 sets. Hence to determine the set, another log(128)/log(2) = 7 bits are needed. Finally, the remaining 32 - (5+7) = 20 bits denote the tag. In the implementation these numbers are calculated using defines, i.e., the properties of the cache can be conveniently changed at compile time tag set offset Figure 1: Interpretation of a 32-bit address. The actual cache lines are given shape in a C-struct holding two Booleans to keep track of whether a line is dirty and/or valid, along with a timestamp 2

3 that is updated each time a line is read or written and finally of course the tag that is needed calculate the original address in memory. A least-recently-used, write-back replacement strategy was implemented, which in practice means that upon a read or write, after looking up the set, an iteration starts over the set members to find a matching tag. In case of a cache miss, a new line has to be allocated. If all lines in the designated set are occupied, the least recently used cache line is evicted. If the evicted line is dirty, it has to be written to memory first as the write-back scheme defers memory access until eviction. Less obvious but of equal importance, is that upon a write-miss it does not suffice to simply evict a cache line and place the newly written value into the cache. Since the processor reads and writes 32-bit words and the cache stores 32 byte lines, the remaining 28 bytes must also be fetched from memory to properly initialize the new cache line. 3 Multi processor system 3.1 Scope and limitations The basic idea of the simulation is to evaluate the performance of the two cache coherency protocols using a set of cooked up trace files that allow for a fair and repeatable experiment. However, the simulation might seem to exhibit the desired or expected behavior, but this is by no means a guarantee of the correctness of the actual implementation. And of course, benchmarking a coherency protocol that doesn t preserve coherency is completely useless. At first sight, the weak spot seemed to be the fact that the traces solely govern read and write operations and their associated addresses. Therefore the simulation does not actually move around data that could potentially be used to verify whether a processor indeed reads the values it is supposed to read. Then again, had the trace files also included data, correct output for one trace does not ensure correct output for another i.e., this approach does not provide a general proof of correctness either. Moreover, an element of non-determinism is incorporated in the fashion in which caches concurrently attempt to access the bus. In case two caches should write to the very same address, of course only either of the two succeeds in writing its value back to memory, which one is up to chance. This however is not an issue the hardware should or even could deal with, instead if undesired it should be resolved programmatically. Cache coherency is a requisite for preserving data consistency, but not visa versa. So the problem of verification boils down to ensuring the cache lines states honor the prescribed state transition diagram. This could be deduced from the source code of the simulation itself, but needless to say this is rather error prone. On the other hand, what the protocol should achieve is that no two copies of a single cache line end up in a mutually exclusive state. This is something that can be checked during each time step during the simulation using an active module that has access to the cache line states of each cache. The designated point for such activity is of course the bus, therefore an optional coherency check was build into the bus module, to verify the correctness of the simulated protocols, given a certain trace file. 3

4 3.2 General design In contrast with the single processor implementation that communicated to the memory module directly, all communication now had to be handled through a bus module that had to be designed from scratch. In order to reduce the complexity of the cache modules that already needed to take care of all cache line state transitions, it seemed reasonable to centralize the logic involved with arbitrage and communication inside the bus module. By describing interfaces for the modules connected to the bus and the bus itself, it was possible to reduce the number of ports and signals, which enhances the structure of the code. Another benefit of the design is that it indeed allows for the verification code to work effectively like proposed in section 3.1. Namely, it ensures that each cache takes notice of bus probes in the exact same delta cycle and just before every cache line is checked for consistency. In case caches would have to wait for the simulator to gain control and to take notice of an event that occurred in a previous delta cycle, the sanity check might wrongly report an unpermitted combination of states as a read or write probe might not have reached all caches at the time of checking. This example exposes a non-trivial but important artifact of the System-C concurrency model. class c a c h e i f : virtual public s c i n t e r f a c e { public : virtual Line g e t l i n e s ( ) = 0 ; virtual bool probe ( Request r ) = 0 ; virtual void r e c e i v e ( Response r ) = 0 ; virtual void wake up ( ) = 0 ; } ; class memory if : virtual public s c i n t e r f a c e { public : virtual void i s s u e ( Request r ) = 0 ; } ; class b u s i f : virtual public s c i n t e r f a c e { public : virtual Hazard q u e u e r e q u e s t ( Request r ) = 0 ; virtual void c a n c e l r e q u e s t ( Request r ) = 0 ; virtual void q u e u e r e s p o n s e ( Response r ) = 0 ; } ; Listing 1: Interfaces as found in task 3.h. The arbitrage of bus access is implemented as follows. The bus module implements a method queue request() to be called by caches to queue a request, cancel request() to redraw an earlier filed request, and queue response() for the memory module to queue a response for the bus to be handled. All methods as described in the bus interface are non-blocking, and instead of letting the caches wait until it acquires but access, the bus maintains one queue for requests and one for responses internally. This way, through a return value it can also immediately notify a cache about a possible hazard with regard to the address a newly filed request is concerned with. Depending on the return value a cache can decide e.g. whether it had a successful write read hit, or in fact suffered a write miss due to a write-after-write (WAW) dependency i.e., another cache already had a write request pending in the request queue. In that case it has to cancel its write request and instead file a new read request after which it 4

5 can retry to put write probe on the bus. The receive() method implemented by the cache module (and thus invoked by the bus) also takes care of signalling the processor module to resume its computation in the event that an awaited response is received, either from memory or another cache holding a valid copy of the requested line. In each delta cycle the bus checks whether the response queue is empty and if not, it propagates the first response in line to the caches using their receive() method. If no responses are pending, the first request in line is handled by probing the caches using their probe() method. In the event of a write request this method allows the caches enforce the prescribed state transitions for their local copies, and in the event of a read request the bus determines whether it can issue a direct cache line transfer or needs to forward the request to memory. Finally, the memory module was given shape. As it was assumed to be pipelined and therefore able to handle one request in each cycle, it is easily argued that it could be omitted from the simulation entirely as caches could simply wait for the assumed latency of 100 cycles and continue operating as if they had received the requested value. But as a side effect of this approach the contention measured on the bus would basically halve. Moreover, due to the latency of 100 cycles it can generally not be assumed that a read issued only a few cycles after a write that regarded the same address, would yield a correct response. More importantly, in the event of memory answering a read request while in the time span of its 100 cycles latency a write request was issued regarding the same address, it would certainly not be acceptable to send the stale value back to the bus. Similarly to the cache coherency protocols, two different schemes could be used to resolved this problem; write-invalidate, or write-update. In practice, write-invalidation would require an administration of active requests and logic to reschedule requests if needed. On the other hand, write-update would require a fast buffer containing not only the addresses associated with the requests, but also the concerned data. In a sense, the write-update scheme forces memory to act much like a cache itself. For the implementation of the memory module the write-invalidate scheme was chosen as it seemed most conservative and more inline with the used cache coherency protocols. In addition, it also allowed for small optimizations like the elimination of duplicate read requests. 3.3 VALID-INVALID protocol The first protocol to be implemented was a VALID-INVALID protocol, the simplest of bus-snooping cache coherency protocols. Bus-snooping means that all caches monitor the activity on the bus which each cache can only write to exclusively. Upon a write probe hit i.e., once a cache notices that another processor just wrote to an address of which it owns a copy, the cache line containing a stale copy of the associated memory segment is invalidated. Using the get lines() method implemented by the cache module, the bus can retrieve the contents of the caches and compare every cache line with every other cache line present in the same set of every other cache. Because of the significant performance drawbacks of running this computation every time step in the simulation, it can be enabled or disabled using a single define. The logic in this function is not earth-shattering as the protocol simply forbids two cache lines to be valid in different caches provided they are mapped to the same set 5

6 and share the same tag. After serving for as a useful problem indicator while debugging, in the end coherency check showed no errors using any of the traces. 3.4 MOESI protocol The MOESI protocol is an extension of the MESI protocol which is the most common protocol that supports a write-back policy. The MESI protocol has four states (modified, exclusive, shared, invalid) and in a sense implements the same invalidation scheme as our simple VALID-INVALID protocol, only it keeps track of whether a cache line is shared or not. It only allows caches to dirty a cache line if it is in modified or exclusive state. Only if it is shared, the other copies need to be invalidated first. The MOESI protocol adds a fifth owned state that is the intersection of the otherwise mutually exclusive modified and shared state. By allowing modified cache lines to be shared, data need not be written to memory before sharing it. Instead the write-back operation is deferred. Proper insight in the workings of the MOESI protocol allowed for a design of the framework as it was used to implement the VALID-INVALID protocol that was well fitted to extend to this slightly more complex protocol. Only minor changes were needed as for example as e.g. letting the cache line C-struct hold an enum of MOESI-states instead of two Booleans. And of course the cache modules were adapted to carry out the additional state transitions. As part of the assignment, also a barrier synchronization was implemented. The processors read synchronization requests from the trace file and forward them to their caches like an ordinary read or write request. The caches on their turn forward the synchronization requests to the bus as it were tokens after which they turn into a sleeping mode. The cache interface was extended with a wake up() method for the bus to call whenever all processors reached a checkpoint. As an option, again to be enabled or disable using a single define, synchronization was also implemented for the VALID-INVALID protocol to make up for a fair comparison with the MOESI protocol. M O E S I Modified Owned Exclusive Shared Invalid Table 1: States given two copies of the same cache line held in different caches. Similar to the implementation of the VALID-INVALID protocol, if desired the bus can verify coherency by comparing cache lines states in each time step of the simulation. But now instead of checking for a single disallowed combination of states, as shown in Table 1, several combinations of state exist that are prohibited by the protocol. Again, after providing useful debugging information, the verifier turned silent using any of the available traces. 6

7 4 Results Aside a set of debugging trace files, two sets of trace files were available for benchmarking: one set that is supposed to mimic the read and write operations of a Fast Fourier Transfer (FFT) and another one that should feature reads and writes to random memory locations. Both sets offer files that contain traces for 1, 2, 4, or 8 processors. A number of experiments were conducted, each time using both sets of benchmarking traces, but every time using different configuration options. The first experiment ignored the barrier synchronization events read from the trace files. The second experiment performed a barrier synchronization between all processors each time such event was read from the trace files. A last experiment was done with synchronization enabled but snooping disabled, as was suggested in the assignment. It must be noted that the latter would force the system into an inconsistent state which is in practice of course not desirable. (a) Using FFT trace. (b) Using random trace. Figure 2: Average hit rate. (a) Using FFT trace. (b) Using random trace. Figure 3: Average probe read hit rate. The first thing that became evident is that looking at the hit rates as displayed in Figure 2, there were no major differences between both protocols, nor could any of the different configurations be distinguished in terms of hit rate performance. That is, given the used trace files. This was rather unexpected as the MOESI protocol is supposed to be more efficient. The advantage of deferring writes to memory however does not necessarily increase the likelihood of finding a needed cache line locally, as a write would otherwise only invalidate cache lines in remote caches. Also, for the neighboring caches the impact of such 7

8 invalidation is reasonably small as a future request for write-invalidated line can be handled using a direct cache line transfer. Unless there is much contention on the bus, not many cycles are wasted. More striking was the fact that the synchronized run with the VALID- INVALID protocol as displayed in Figure 3(b), outperformed MOESI in the number of probe read hits given random reads and writes. Then again, randomness is exactly what caches cannot anticipate for, nor could a coherency protocol improve a cache its ability to do so. (a) Using FFT trace. (b) Using random trace. Figure 4: Bus contention. Contention was measured in the bus module by counting the number of delta cycles in which there was more than a single request to be handled by the bus i.e., when the sum of elements in the request- en response queue was greater than one. The ratio of cycles the bus was in contention to the number of cycles that passed in the simulation expresses the contention as displayed Figure 4. A first observation regards the decreased in contention for the synchronized runs. Obviously, synchronization reliefs the bus as it forces processors to wait until they reach the barrier instead of continuing to race towards the end of their trace. Also, the MOESI protocol seems to have a smaller footprint on the bus, which might be attributed to deferred writes. However unexpected, it must be noted that MOESI used more memory writes then the VALID-INVALID did. Most likely this is due to a bug in the accounting of memory writes or in the simulation itself. (a) Using FFT trace. (b) Using random trace. Figure 5: Simulation time. After studying the preceding graphs, it is not very surprising to see that in terms of simulation time as displayed in Figure 5, both protocols are again 8

9 comparable. In fact the VALID-INVALID protocol is in most cases outperforming MOESI, which is rather disappointing. Also, the overhead induced by the barrier synchronization clearly shows in both graphs. For less then four processors, the asynchronous run with the MOESI protocol using the random trace is slightly faster than VALID-INVALID using the same configuration. In order to get a better idea of the numbers behind the conducted experiments, some raw program output is available in Appendix A. 5 Conclusions All parts of the assignment were completed successfully and the correctness of the implementations with regard to their preservation of coherency is verified in the simulations themselves. Although this cannot be accepted as a general proof of correctness, at least the runs of which the results were discussed in this report never ended up in an inconsistent state. The results of the experiments were a bit disappointing as the comparison between both protocols showed no clear winner in terms of performance. It appears that the provided trace files do no exhibit the behavior that lets the advantage of the MOESI protocol shine through. A more general but nevertheless important lesson could be learned from this, which is that proper cache utilization largely depends on the memory access pattern at hand. It signifies the importance of understanding the dynamics of caching, especially in multi-core shared memory systems where the issue of preserving cache coherency comes into play. A Program output The following listings cover the output of 8-core experiments with barrier synchronization using the two different benchmark traces. Output listing 1 task 2.bin trace files-v3/fft 16 p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns 9

10 Output listing 2 task 3.bin trace files-v3/fft 16 p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns Output listing 3 task 2.bin trace files-v3/rnd p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns Output listing 4 task 3.bin trace files-v3/rnd p8.trf CPU Reads RHit RMiss Writes WHit WMiss Hitrate Sim Time = ns 10

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017 CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Basics of Cache Memory James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy Cache Memory Cache

More information

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College Computer Systems C S 0 7 Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College 2 Today s Topics TODAY S LECTURE: Caching ANNOUNCEMENTS: Assign6 & Assign7 due Friday! 6 & 7 NO late

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work? EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology

More information

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013 CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013 CACHING Why: bridge speed difference between CPU and RAM Modern RAM allows blocks of memory to be read quickly Principle

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Lab 5: A Non-Blocking Instruction Cache

Lab 5: A Non-Blocking Instruction Cache Lab 5: A Non-Blocking Instruction Cache 4541.763 Laboratory 5 Assigned: November 3, 2009 Due: November 17, 2009 1 Introduction In Lab 4, you were given a multi-cycle SMIPSv2 implementation, which you then

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

An Overview of MIPS Multi-Threading. White Paper

An Overview of MIPS Multi-Threading. White Paper Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains: The Lecture Contains: Program Optimization for Multi-core: Hardware Side of It Contents RECAP: VIRTUAL MEMORY AND CACHE Why Virtual Memory? Virtual Memory Addressing VM VA to PA Translation Page Fault

More information

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit. Memory Hierarchy Goal: Fast, unlimited storage at a reasonable cost per bit. Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Fast: When you need something

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Computer Systems Organization The CPU (Central Processing Unit) is the brain of the computer. Fetches instructions from main memory.

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Q1. What is Deadlock? Explain essential conditions for deadlock to occur?

Q1. What is Deadlock? Explain essential conditions for deadlock to occur? II nd Midterm session 2017-18 Subject: Operating System ( V CSE-B ) Q1. What is Deadlock? Explain essential conditions for deadlock to occur? In a multiprogramming environment, several processes may compete

More information

Memory Hierarchy Design (Appendix B and Chapter 2)

Memory Hierarchy Design (Appendix B and Chapter 2) CS359: Computer Architecture Memory Hierarchy Design (Appendix B and Chapter 2) Yanyan Shen Department of Computer Science and Engineering 1 Four Memory Hierarchy Questions Q1 (block placement): where

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:

More information

Basic Memory Management

Basic Memory Management Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester 10/15/14 CSC 2/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Top-Level View of Computer Organization

Top-Level View of Computer Organization Top-Level View of Computer Organization Bởi: Hoang Lan Nguyen Computer Component Contemporary computer designs are based on concepts developed by John von Neumann at the Institute for Advanced Studies

More information

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25 PROJECT 6: PINTOS FILE SYSTEM CS124 Operating Systems Winter 2015-2016, Lecture 25 2 Project 6: Pintos File System Last project is to improve the Pintos file system Note: Please ask before using late tokens

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information