Introducing the SCSD \ Shared Cache for Shared Data" Multiprocessor Architecture. Nagi N. Mekhiel

Size: px
Start display at page:

Download "Introducing the SCSD \ Shared Cache for Shared Data" Multiprocessor Architecture. Nagi N. Mekhiel"

Transcription

1 Introducing the SD \ Shared Cache for Shared Data" ultiprocessor Architecture Nagi N. ekhiel Department of Electrical and Computer Engineering Ryerson Polytechnic University, Toronto, Ontario 5B 2K3 Yarc Systems, Newbury Park, CA91320 Abstract The model improves the performance of the shared memory multiprocessor systems by separating shared data from private data. Private data migrate to the local cache of each processor and the shared data to a shared cache. We present the architecture and protocols for the SD model. The protocols need not to do consistency check which reduces the demand for the shared bus. Results show that the SD model reduces the cost of an access and if it uses a dual bus, the cost could become independent on the data sharing. 1 Introduction Shared memory multiprocessor systems provide programmer with a simple and easy programming environment and use a single bus for all processors to access the memory. Single bus shared memory systems suer from bus saturation. An eective solution to the bus saturation problem is to use a local cache for each processor. A cache coherency protocol is needed for each cache to keep the consistency between the same data items in dierent caches [8],[6]. Coherency protocols use the shared bus to snoop the data which increase the demand for the bus and limits the scalability of the system [4],[6]. The cache coherency problem could be eliminated or reduced with the use of a single shared cache [1],[3]. The problem with sharing a single cache is that more than one processor access it at the same time and becomes the system bottleneck (access conicts). Dierent research work discussed and evaluated the shared cache architecture [1],[2],[3]. In all of these research work, private and shared data eisted in the same cache thus competing with each other which makes shared cache less ecient and causes access con- icts. In this paper, we introduce the SD and present a suitable architecture, protocols and a cost model to evaluate its performance. 2 SD Architecture and Concept Figure 1 shows the architecture of SD with a single and a dual bus. The Processors use local caches for private data and share a single cache for the shared data. The and caches could use write through or write back policies. The cache tag does not use private/shared or valid/invalid bits. Shared data eists only in and private data eists only in caches and there is no need for having a valid or shared bits. The single bus model uses one bus for both S (shared memory), local caches and the (shared cache). The dual bus model uses one bus for the shared memory and another bus for the. Both buses could be used simultaneously. Bus snooping is only needed to identify and convert items from not shared to shared. All processors are of RI, Harvard type architecture running 1 instruction per clock with pipelining and share a single address space and same memory. 2.1 SD Concept Figure 2 shows the concept of the SD model. Private items like 1 and n map to (local caches) of and. Shared items like is transferred to one of the local caches when it's requested for the rst time, if later requested by another processor, the system transfers it to the shared cache using a swap operation. The system moves to the and makes it shared, and moves ' (the replaced item in ) to the same location of in local cache (swap operation).

2 -Pr,Pw=Processor read and write -br,bw=bus read and write -S=shared -NS=not shared br, bw Pr, Pw Pr,Pw,br,bw NS S Figure 3: SD No coherency Write Through Protocol Single bus SD model Dual bus SD model Figure 1: SD Architecture 3 The SD no coherency Protocols The protocols for the SD model need not to do any consistency check because the shared data and private data eist in a separate caches. Only one copy of private data and one copy of shared data eist in the caches. Snooping is needed only when a processor requires a shared data item that eists in another processor local cache as a private item. The main purpose of the protocol is to separate shared from private data and forbid multiple copies of same data items to eists in caches. 3.1 SD No coherency Write Through Protocol 1 y n y1 yn 1 n Swap Figure 2: SD Concept Figure 3 shows the SD write through no coherency protocol. The item enters the NS (not shared) state, when a processor requests data from main memory for the rst time. The item enters the S (shared) state when a processor requests a NS item that is in the local cache of another processor. A Processor reads or writes an item in state NS cache does not change the state of the item. Other processors (using the bus) read or write an item in state NS causes the item to be shared (goes to state S). A Processor or any other processor reads or writes an item in state S does not change the state of the item. A d item in state S when it's replaced by another item in cache causes the item to be not shared (goes to state NS). The protocol does not use invalidate or update policies (no coherency check). It only snoops the bus when the requested shared item is not found in unit. 3.2 SD No coherency Write Back Protocol Figure 4 shows the SD write back no coherency protocol. The data item enters the NS (not shared)

3 -Pr,Pw=Processor read and write -br,bw=bus read and write -S=shared -NS=not shared -D=dirty br Operation Read hit Read miss S cost Pv.T1 + (1-Pv).(2T1 + Tm + Tb) 2T1 + Tb + Tm SD cost + Ps.(T1 + Tb1) (1-Ps).(T1 + Tb +Tm) + Ps.(2T1 + Tb1) NS S Pr,br Write hit Tb + Tm + 2T1 (1-Ps).(T1 + Tb +Tm) + Ps.(T1 + Tb + Tm) Pr Pw,Pr Pw NS & D br,bw bw S & D Pw,bw Pr,Pw,br,bw Figure 4: SD No coherency Write Back Protocol state, when a processor request data from main memory for the rst time. The item enters the S (shared) state when a processor requests a NS item that is in the local cache of another processor. The item in state S or NS becomes dirty after a write operation. A Processor reads an item in state NS from its local cache does not change the state of the item, a write changes the state of the item to NS&D (not shared and dirty). Other processors (using the bus) read an item in state NS changes the state to S (shared) and a write to a not shared item (state NS) changes it to shared and dirty (goes to state S&D). An item that is not shared and dirty (in state NS&D) does not change its state if the local processor reads or writes to this item and goes to state shared and dirty (S&D) if other processor reads or writes to it. A Processor or any other processor reads a shared item (state S) in the shared cache does not change the state of the item. The writes change the state of the item to shared and dirty (S&D). An item that is shared and dirty (in state S&D) does not change state if the local processor or other processor reads or writes to this item. A d shared item (state S) when it's replaced by another item in cache causes the item to be not shared (goes to state NS). A d shared and dirty item (in state S&D) when it's replaced by another item in cache causes the item to be not shared and dirty (goes to state NS&D). The protocol does not use invalidate or update policies (no coherency check). It does only snoop the bus when the requested shared item is not found in unit. Write miss Figure 5: Through Tb + Tm + 2T1 (1-Ps).(T1 + Tb +Tm) + Ps.(2T1 + Tb1) Cost models for S and SD Write 4 The Cost odel To evaluate the SD model, we construct an approimate cost models for the SD. The model total cost is obtained by rst multiplying the cost of each operation by its probability and then add each latency. We dene the following parameters: T1=Access time of or, Tm=Access time of main memory (does not include bus time), Tb=mean bus waiting time for a shared memory model, Tb1=mean bus waiting time for a shared cache using the dual bus architecture, Pr=Probability of memory access to be a read, (1-Pr)=Probability of memory access to be a write, Ps=Probability of memory access to be shared, (1-Ps)=Probability of memory access to be not shared, Pv=Probability of memory access to be valid for the shared memory (S) model, (1-Pv)=Probability of memory access to be not valid for the shared memory (S) model, Pd=Probability of memory access to be dirty, (1-Pd)=Probability of memory access to be clean, 1=miss rate for local caches and s=miss rate for shared cache. We nd the cost model for the SD model and compare it with the cost model of the known single bus shared memory \S" architecture. The table of Figure 5 shows the cost models for the S as in [8] and SD architecture using the write through protocol. The table of Figure 6 shows the cost models for the S as in [8] and SD architecture using the write back protocol. 5 The Results We use the the following values for the models parameters: T1=1 cycle, Tm=20 cycles, Tb=100 cycles, Tb1=5 cycles for dual bus architecture, Tb1=100 cycles for the single bus architecture, Pr=.7, Pv=.4, Pd=.4 and Ps=.05 to.5 and 1=s=.05.

4 Operation Read hit S cost + Ps.Pv.T1+Ps.(1-Pv).(Tb+Tm+2T1) SD cost + Ps.(T1+Tb1) Cost in cycles WT(S)= WT shared memory architecture WT(SD)=WT SD dual bus architecture WB(S)=WB shared memory architecture WB(SD)=WB SD dual bus architecture Read miss Pd.(3T1+2Tm+Tb) + (1-Pd).(3T1+Tm+Tb) (1-Ps).Pd.(T1+2Tm+Tb) +(1-Ps).(1-pd).(T1+Tm+Tb) + Ps.(2T1+Tb1) WT(S) 60 Write hit + Ps.(2T1+Tb) + Ps.(T1+ Tb1) WT(SD) WB(S) 30 Write miss T1+Tm+Tb (1-Ps).(T1+Tm+Tb) +Ps.(2T1+Tb1) WB(SD) sharing ratio Figure 6: Cost models for S and SD Write Back Figure 8: Results of Dual bus S and SD models Cost in cycles WT(S) WT(S)=WT shared memory architecture WT(SD)= WT SD single bus architecture WB(S)= WB shared memory architecture WB(SD)=WB SD single bus architecure WT(SD) WB(SD) WB(S) sharing ratio Figure 7: Results of Single bus S and SD models and write back assuming that SD uses the dual bus architecture and the value of Tb1=5 cycles. The results show that the SD model reduces the cost of an access for the write through policy by more than %50. The cost of the SD model is much smaller than the cost of the S model for the write back policy. The cost of an access to the SD model for either write through or write back does not depend on the sharing ratio, which indicates that this model could be scalable to a large number of processors. In the above results we did not account for the eect of invalidation or miss rate dierences between S and SD models. The values of the above parameters are selected to match the values of similar shared memory multiprocessor systems as in [5]. Figure 7 shows the total cost for the shared memory model compared to the total cost of the SD model for write through and write back. In this case we assume that S and SD use the single bus architecture. The value of Tb1=100 cycles (the same of memory bus). The results show that the SD model reduces the cost of an access for the write through policy by %50 for low sharing ratio and by %25 for high sharing ratio. The cost of the SD model is similar to the cost of the S model for the write back policy. In the above results we did not account for the eect of invalidation on bus delay (should be much less in the SD) and further more, the miss rate for the shared cache and local caches for the SD are assumed to be the same as in S model (The separation of shared data from private data should reduce the miss rates for the SD). Figure 8 shows the total cost for shared memory compared to the SD no coherency for write through 6 Conclusions and Future Work We have introduced a new SD model that uses separate local caches for private data and one single shared cache for the shared data and presented two dierent architecture to implement this model using a cost eective single bus system and a high performance dual bus system. Two no coherency write through and write back protocols are given. The protocols implement the SD concept without any coherency check. The results of an approimate cost models show that the SD architecture gives more performance than the shared memory architecture for a write through protocol and for the dual bus architecture, the performance of the SD system is greatly improved and could become independent on the ratio of shared data. Our future plans include studying other architecture for the SD model like using a multi-bank cache with fast network for the and accurately evaluate this model (using trace simulation).

5 References [1] Basem A. Nayfeh and Kunle Olukotun, \Eploring the Design Space for a Shared-Cache ultiprocessor", 21 Intl. Symp. on Comp. Arch. pages , [2] Erick Hagersten, Anders Landin, and Sief Haridi, \DD- A Cache-Only emory Architecture", Computer vol.25, No.9, pp September [3] Phil C.C. Yeh, Janak H. Patel, and Edward S. Davidson, \Shared Cache for ultiple-stream Computer Systems", IEEE Transaction on Computers vol. C-32, No.1, pp January [4] K. Uchiyama, H. Aoki, \Design of a secondlevel cache chip for shared-bus multimicroprocessor systems", IEEE Solid state circuits vol. 26,No 4,pp April [5]. Vernon, E. D. Lazowska, \An accurate and ecient performance analysis technique for multiprocessor snooping cache-consistency protocols", Proc. 15th Annu. Symp. Comput. Architecture, Honolulu, HI, June 1988, pp [6]. C. Chiang, G. S. Sohi, \Evaluating design choices for shared bus multiprocessors in a throughput-oriented environment", IEEE Transaction on Computers vol.41, No.3, pp arch [7] John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, organ Kaufmann, San eteo, California,1990. [8] Faye A. Briggs, \Synchronization, Coherence, and Event Ordering in ultiprocessors", Computer pp.9-21 February 1988.

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter Seven. Idea: create powerful computers by connecting many smaller ones Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014 Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:

More information

MULTIPROCESSOR system has been used to improve

MULTIPROCESSOR system has been used to improve arallel Vector rocessing Using Multi Level Orbital DATA Nagi Mekhiel Abstract Many applications use vector operations by applying single instruction to multiple data that map to different locations in

More information

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1) Topics

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location. Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4 Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,

More information

Tradeos in the Design of Single Chip. Department of Electrical and Computer Engineering, University of Massachusetts,

Tradeos in the Design of Single Chip. Department of Electrical and Computer Engineering, University of Massachusetts, Tradeos in the Design of Single Chip Multiprocessors David H. Albonesi and Israel Koren Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003, USA Abstract:

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16 Shared memory Caches, Cache coherence and Memory consistency models Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Shared memory Caches, Cache

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

CACHE COHERENCE ON A SLOTTED RING

CACHE COHERENCE ON A SLOTTED RING CACHE COHERENCE ON A SLOTTED RING Luiz A. Barroso and Michel Dubois EE-Systems Department University of Southern California Los Angeles, CA 90089-1115 Abstract -- The Express Ring is a new architecture

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS152 Computer Architecture and Engineering Lecture 17: Cache System CS152 Computer Architecture and Engineering Lecture 17 System March 17, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http//http.cs.berkeley.edu/~patterson

More information

Speculative Versioning Cache: Unifying Speculation and Coherence

Speculative Versioning Cache: Unifying Speculation and Coherence Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical

More information

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements EECS15 - Digital Design Lecture 11 SRAM (II), Caches September 29, 211 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http//www-inst.eecs.berkeley.edu/~cs15 Fall

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Investigating the Effectiveness of a Third Level Cache

Investigating the Effectiveness of a Third Level Cache Investigating the Effectiveness of a Third Level Cache by S. Ghai J. Joyner L. John IBM Corporation IBM Corporation ECE Department Austin, TX 78758 Austin, TX 78758 The University of Texas at Austin sanj@austin.ibm.com

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

CSC526: Parallel Processing Fall 2016

CSC526: Parallel Processing Fall 2016 CSC526: Parallel Processing Fall 2016 WEEK 5: Caches in Multiprocessor Systems * Addressing * Cache Performance * Writing Policy * Cache Coherence (CC) Problem * Snoopy Bus Protocols PART 1: HARDWARE Dr.

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Introducing TAM: Time-Based Access Memory

Introducing TAM: Time-Based Access Memory SPECIAL SECTION ON SECURITY AND RELIABILITY AWARE SYSTEM DESIGN FOR MOBILE COMPUTING DEVICES Received December 9, 2015, accepted January 27, 2016, date of publication February 3, 2016, date of current

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches Overview ost cache protocols are more complicated than two state Snooping not effective for network-based systems Consider three

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

ECE 30 Introduction to Computer Engineering

ECE 30 Introduction to Computer Engineering ECE 0 Introduction to Computer Engineering Study Problems, Set #9 Spring 01 1. Given the following series of address references given as word addresses:,,, 1, 1, 1,, 8, 19,,,,, 7,, and. Assuming a direct-mapped

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Levels in

More information

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, La Jolla, California, June 1995. Predicting the Worst-Case Execution Time of the Concurrent Execution of Instructions and Cycle-Stealing

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. * Design of A Memory Latency Tolerant Processor() Naohiko SHIMIZU* Kazuyuki MIYASAKA** Hiroaki HARAMIISHI** *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. 1117 Kitakaname Hiratuka-shi

More information

High Performance Multiprocessor System

High Performance Multiprocessor System High Performance Multiprocessor System Requirements : - Large Number of Processors ( 4) - Large WriteBack Caches for Each Processor. Less Bus Traffic => Higher Performance - Large Shared Main Memories

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

The Performance of Cache-Coherent Ring-based Multiprocessors

The Performance of Cache-Coherent Ring-based Multiprocessors Appeared in the Proceedings of the 2th Intl. Symp. on Computer Architecture, May 1993 The Performance of Cache-Coherent Ring-based Multiprocessors Luiz André Barroso and Michel Dubois barroso@paris.usc.edu;

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Protocols Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report Minimizing the Directory Size for Large-scale DSM Multiprocessors Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis,

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

The Sun Fireplane Interconnect in the Mid- Range Sun Fire Servers

The Sun Fireplane Interconnect in the Mid- Range Sun Fire Servers TAK IT TO TH NTH Alan Charlesworth icrosystems The Fireplane Interconnect in the id- Range Fire Servers Vertical & Horizontal Scaling any CUs in one box Cache-coherent shared memory (S) Usually proprietary

More information

ECE 551 System on Chip Design

ECE 551 System on Chip Design ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Performance study example ( 5.3) Performance study example

Performance study example ( 5.3) Performance study example erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial

More information