Introducing the SCSD \ Shared Cache for Shared Data" Multiprocessor Architecture. Nagi N. Mekhiel

Size: px

Start display at page:

Download "Introducing the SCSD \ Shared Cache for Shared Data" Multiprocessor Architecture. Nagi N. Mekhiel"

Alberta Martina Blake
5 years ago
Views:

1 Introducing the SD \ Shared Cache for Shared Data" ultiprocessor Architecture Nagi N. ekhiel Department of Electrical and Computer Engineering Ryerson Polytechnic University, Toronto, Ontario 5B 2K3 Yarc Systems, Newbury Park, CA91320 Abstract The model improves the performance of the shared memory multiprocessor systems by separating shared data from private data. Private data migrate to the local cache of each processor and the shared data to a shared cache. We present the architecture and protocols for the SD model. The protocols need not to do consistency check which reduces the demand for the shared bus. Results show that the SD model reduces the cost of an access and if it uses a dual bus, the cost could become independent on the data sharing. 1 Introduction Shared memory multiprocessor systems provide programmer with a simple and easy programming environment and use a single bus for all processors to access the memory. Single bus shared memory systems suer from bus saturation. An eective solution to the bus saturation problem is to use a local cache for each processor. A cache coherency protocol is needed for each cache to keep the consistency between the same data items in dierent caches [8],[6]. Coherency protocols use the shared bus to snoop the data which increase the demand for the bus and limits the scalability of the system [4],[6]. The cache coherency problem could be eliminated or reduced with the use of a single shared cache [1],[3]. The problem with sharing a single cache is that more than one processor access it at the same time and becomes the system bottleneck (access conicts). Dierent research work discussed and evaluated the shared cache architecture [1],[2],[3]. In all of these research work, private and shared data eisted in the same cache thus competing with each other which makes shared cache less ecient and causes access con- icts. In this paper, we introduce the SD and present a suitable architecture, protocols and a cost model to evaluate its performance. 2 SD Architecture and Concept Figure 1 shows the architecture of SD with a single and a dual bus. The Processors use local caches for private data and share a single cache for the shared data. The and caches could use write through or write back policies. The cache tag does not use private/shared or valid/invalid bits. Shared data eists only in and private data eists only in caches and there is no need for having a valid or shared bits. The single bus model uses one bus for both S (shared memory), local caches and the (shared cache). The dual bus model uses one bus for the shared memory and another bus for the. Both buses could be used simultaneously. Bus snooping is only needed to identify and convert items from not shared to shared. All processors are of RI, Harvard type architecture running 1 instruction per clock with pipelining and share a single address space and same memory. 2.1 SD Concept Figure 2 shows the concept of the SD model. Private items like 1 and n map to (local caches) of and. Shared items like is transferred to one of the local caches when it's requested for the rst time, if later requested by another processor, the system transfers it to the shared cache using a swap operation. The system moves to the and makes it shared, and moves ' (the replaced item in ) to the same location of in local cache (swap operation).

2 -Pr,Pw=Processor read and write -br,bw=bus read and write -S=shared -NS=not shared br, bw Pr, Pw Pr,Pw,br,bw NS S Figure 3: SD No coherency Write Through Protocol Single bus SD model Dual bus SD model Figure 1: SD Architecture 3 The SD no coherency Protocols The protocols for the SD model need not to do any consistency check because the shared data and private data eist in a separate caches. Only one copy of private data and one copy of shared data eist in the caches. Snooping is needed only when a processor requires a shared data item that eists in another processor local cache as a private item. The main purpose of the protocol is to separate shared from private data and forbid multiple copies of same data items to eists in caches. 3.1 SD No coherency Write Through Protocol 1 y n y1 yn 1 n Swap Figure 2: SD Concept Figure 3 shows the SD write through no coherency protocol. The item enters the NS (not shared) state, when a processor requests data from main memory for the rst time. The item enters the S (shared) state when a processor requests a NS item that is in the local cache of another processor. A Processor reads or writes an item in state NS cache does not change the state of the item. Other processors (using the bus) read or write an item in state NS causes the item to be shared (goes to state S). A Processor or any other processor reads or writes an item in state S does not change the state of the item. A d item in state S when it's replaced by another item in cache causes the item to be not shared (goes to state NS). The protocol does not use invalidate or update policies (no coherency check). It only snoops the bus when the requested shared item is not found in unit. 3.2 SD No coherency Write Back Protocol Figure 4 shows the SD write back no coherency protocol. The data item enters the NS (not shared)

3 -Pr,Pw=Processor read and write -br,bw=bus read and write -S=shared -NS=not shared -D=dirty br Operation Read hit Read miss S cost Pv.T1 + (1-Pv).(2T1 + Tm + Tb) 2T1 + Tb + Tm SD cost + Ps.(T1 + Tb1) (1-Ps).(T1 + Tb +Tm) + Ps.(2T1 + Tb1) NS S Pr,br Write hit Tb + Tm + 2T1 (1-Ps).(T1 + Tb +Tm) + Ps.(T1 + Tb + Tm) Pr Pw,Pr Pw NS & D br,bw bw S & D Pw,bw Pr,Pw,br,bw Figure 4: SD No coherency Write Back Protocol state, when a processor request data from main memory for the rst time. The item enters the S (shared) state when a processor requests a NS item that is in the local cache of another processor. The item in state S or NS becomes dirty after a write operation. A Processor reads an item in state NS from its local cache does not change the state of the item, a write changes the state of the item to NS&D (not shared and dirty). Other processors (using the bus) read an item in state NS changes the state to S (shared) and a write to a not shared item (state NS) changes it to shared and dirty (goes to state S&D). An item that is not shared and dirty (in state NS&D) does not change its state if the local processor reads or writes to this item and goes to state shared and dirty (S&D) if other processor reads or writes to it. A Processor or any other processor reads a shared item (state S) in the shared cache does not change the state of the item. The writes change the state of the item to shared and dirty (S&D). An item that is shared and dirty (in state S&D) does not change state if the local processor or other processor reads or writes to this item. A d shared item (state S) when it's replaced by another item in cache causes the item to be not shared (goes to state NS). A d shared and dirty item (in state S&D) when it's replaced by another item in cache causes the item to be not shared and dirty (goes to state NS&D). The protocol does not use invalidate or update policies (no coherency check). It does only snoop the bus when the requested shared item is not found in unit. Write miss Figure 5: Through Tb + Tm + 2T1 (1-Ps).(T1 + Tb +Tm) + Ps.(2T1 + Tb1) Cost models for S and SD Write 4 The Cost odel To evaluate the SD model, we construct an approimate cost models for the SD. The model total cost is obtained by rst multiplying the cost of each operation by its probability and then add each latency. We dene the following parameters: T1=Access time of or, Tm=Access time of main memory (does not include bus time), Tb=mean bus waiting time for a shared memory model, Tb1=mean bus waiting time for a shared cache using the dual bus architecture, Pr=Probability of memory access to be a read, (1-Pr)=Probability of memory access to be a write, Ps=Probability of memory access to be shared, (1-Ps)=Probability of memory access to be not shared, Pv=Probability of memory access to be valid for the shared memory (S) model, (1-Pv)=Probability of memory access to be not valid for the shared memory (S) model, Pd=Probability of memory access to be dirty, (1-Pd)=Probability of memory access to be clean, 1=miss rate for local caches and s=miss rate for shared cache. We nd the cost model for the SD model and compare it with the cost model of the known single bus shared memory \S" architecture. The table of Figure 5 shows the cost models for the S as in [8] and SD architecture using the write through protocol. The table of Figure 6 shows the cost models for the S as in [8] and SD architecture using the write back protocol. 5 The Results We use the the following values for the models parameters: T1=1 cycle, Tm=20 cycles, Tb=100 cycles, Tb1=5 cycles for dual bus architecture, Tb1=100 cycles for the single bus architecture, Pr=.7, Pv=.4, Pd=.4 and Ps=.05 to.5 and 1=s=.05.

4 Operation Read hit S cost + Ps.Pv.T1+Ps.(1-Pv).(Tb+Tm+2T1) SD cost + Ps.(T1+Tb1) Cost in cycles WT(S)= WT shared memory architecture WT(SD)=WT SD dual bus architecture WB(S)=WB shared memory architecture WB(SD)=WB SD dual bus architecture Read miss Pd.(3T1+2Tm+Tb) + (1-Pd).(3T1+Tm+Tb) (1-Ps).Pd.(T1+2Tm+Tb) +(1-Ps).(1-pd).(T1+Tm+Tb) + Ps.(2T1+Tb1) WT(S) 60 Write hit + Ps.(2T1+Tb) + Ps.(T1+ Tb1) WT(SD) WB(S) 30 Write miss T1+Tm+Tb (1-Ps).(T1+Tm+Tb) +Ps.(2T1+Tb1) WB(SD) sharing ratio Figure 6: Cost models for S and SD Write Back Figure 8: Results of Dual bus S and SD models Cost in cycles WT(S) WT(S)=WT shared memory architecture WT(SD)= WT SD single bus architecture WB(S)= WB shared memory architecture WB(SD)=WB SD single bus architecure WT(SD) WB(SD) WB(S) sharing ratio Figure 7: Results of Single bus S and SD models and write back assuming that SD uses the dual bus architecture and the value of Tb1=5 cycles. The results show that the SD model reduces the cost of an access for the write through policy by more than %50. The cost of the SD model is much smaller than the cost of the S model for the write back policy. The cost of an access to the SD model for either write through or write back does not depend on the sharing ratio, which indicates that this model could be scalable to a large number of processors. In the above results we did not account for the eect of invalidation or miss rate dierences between S and SD models. The values of the above parameters are selected to match the values of similar shared memory multiprocessor systems as in [5]. Figure 7 shows the total cost for the shared memory model compared to the total cost of the SD model for write through and write back. In this case we assume that S and SD use the single bus architecture. The value of Tb1=100 cycles (the same of memory bus). The results show that the SD model reduces the cost of an access for the write through policy by %50 for low sharing ratio and by %25 for high sharing ratio. The cost of the SD model is similar to the cost of the S model for the write back policy. In the above results we did not account for the eect of invalidation on bus delay (should be much less in the SD) and further more, the miss rate for the shared cache and local caches for the SD are assumed to be the same as in S model (The separation of shared data from private data should reduce the miss rates for the SD). Figure 8 shows the total cost for shared memory compared to the SD no coherency for write through 6 Conclusions and Future Work We have introduced a new SD model that uses separate local caches for private data and one single shared cache for the shared data and presented two dierent architecture to implement this model using a cost eective single bus system and a high performance dual bus system. Two no coherency write through and write back protocols are given. The protocols implement the SD concept without any coherency check. The results of an approimate cost models show that the SD architecture gives more performance than the shared memory architecture for a write through protocol and for the dual bus architecture, the performance of the SD system is greatly improved and could become independent on the ratio of shared data. Our future plans include studying other architecture for the SD model like using a multi-bank cache with fast network for the and accurately evaluate this model (using trace simulation).

5 References [1] Basem A. Nayfeh and Kunle Olukotun, \Eploring the Design Space for a Shared-Cache ultiprocessor", 21 Intl. Symp. on Comp. Arch. pages , [2] Erick Hagersten, Anders Landin, and Sief Haridi, \DD- A Cache-Only emory Architecture", Computer vol.25, No.9, pp September [3] Phil C.C. Yeh, Janak H. Patel, and Edward S. Davidson, \Shared Cache for ultiple-stream Computer Systems", IEEE Transaction on Computers vol. C-32, No.1, pp January [4] K. Uchiyama, H. Aoki, \Design of a secondlevel cache chip for shared-bus multimicroprocessor systems", IEEE Solid state circuits vol. 26,No 4,pp April [5]. Vernon, E. D. Lazowska, \An accurate and ecient performance analysis technique for multiprocessor snooping cache-consistency protocols", Proc. 15th Annu. Symp. Comput. Architecture, Honolulu, HI, June 1988, pp [6]. C. Chiang, G. S. Sohi, \Evaluating design choices for shared bus multiprocessors in a throughput-oriented environment", IEEE Transaction on Computers vol.41, No.3, pp arch [7] John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, organ Kaufmann, San eteo, California,1990. [8] Faye A. Briggs, \Synchronization, Coherence, and Event Ordering in ultiprocessors", Computer pp.9-21 February 1988.

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3