Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

Size: px

Start display at page:

Download "Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors"

Emil Reynolds
5 years ago
Views:

1 Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa 1 Introduction 1.1 Backgrounds Chip Multiprocessors With rapidly improving technology of semiconductor, we can integrate large number of transistors on one silicon chip. To get higher performance, a superscalar architecture which utilizes Instruction Level Parallelism (ILP) in a program code is widely used today. Thanks to enormous transistors on a chip, a wider issue width superscalar processor with larger cache have been built to obtain higher performance. However, because ILP in a code is limited, it is difficult to increase performance simply by parallelizing instructions in a code. In addition, the more complex a superscalar processor is, the more expensive its developing cost becomes. To solve this problem, another architecture called Chip Multiproceoosr(CMP) has been proposed. In a CMP architecture, a superscalar processor is duplicated across the chip as multiple Processing Units (PU). By executing multiple threads simultaneously on each PUs, a CMP utilize the Thread Level Parallelism (TLP). Therefore, a CMP has a potential to achieve good performance for multithreaded program. In addition to this performance increase, CMP has another advantage. Each processor core built in CMP needs not to be as complex as an up-to-date superscalar processor to carry out the same performance, because it can employ TLP in addition to ILP at execution time. Also, by duplicating the core, we can design CMP more easily than a complex single core superscalar processor. This simplicity contributes to contraction of designing term and cost cutting Speculative Multithreading Except for numerical programs, many program codes for general purpose are integer program. In general, these integer programs have little potential to be parallelized and if we cut such program code into threads, it may yield dependency among these treads. Because of these dependencies, it is impossible to execute these interdependent threads simultaneously and this fact makes it difficult to efficiently run an integer program on CMP. Speculative multithreading is a technique that execute multiple threads at the same time even if there are some dependencies. This technique enables us to extract more parallelism even from integer programs by ignoring dependencies and to execute them effectively on CMP using multiple cores simultaneously. Thanks to speculative multithreading, we can execute many kinds of programs on CMP efficiently at some level. One important point in speculative multithreading is violation detection. In speculative multithreading, threads are executed all at once without taking care of dependency among threads, so there may be violation. If dependency violation occurs, the execution result may be incorrect, so it is indispensable to detect violation and treat it properly afterwards Memory Speculation Memory speculation is one of the key components of speculative multithreading. An example of memory speculation is shown in fig. 1(a). In this example, original sequential code is divided into two threads (Th0 and Th1) and these threads are allocated to PU2 and PU3, respectively. The Th0 should be executed before Th1 is executed according to the original code they are cut from. When Th0 is to be executed earlier than Th1 in the original sequence, we can say The

2 speculation level of Th1 is higher than that of Th0. There is an instruction Store in the Th0 and its destination address is p, while an instruction Load is in the Th1 and its destination address is q. These destination address are not decidable at compile time. The dependency between these two threads is ambiguous. In speculative multithreading, as depicted in fig. 1(a), multiple threads are executed at the same time. In this case, if the address p equals to q, the two instructions Store and Load are executed in reverse order in the time-line. These two instructions may be interdependent, but the Load is executed speculatively. This is memory speculation. PU2 PU3 PU2 PU3 PU3 Th0 Time Th0 Store p Th1 Load q (a) Memory Speculation If p=q, Violation! Store p Th1 Load q Squash thread (b) Thread squash and re-execution Figure 1 : Speculative Multithreading Re-execute Th1 Load q If the destination addresses are the same (p = q), Th1 may load UN-correct data and returns an irrelevant result (fig. 1(b)). This is called RAW violation. It is impossible to avoid violation in speculative multithreading completely. So, when some violation happens, some treatment such as squash and reexecution of the thread should be done in order to execute code properly. 1.2 Research Objectives Recently, many architecture models for speculative multithreading for CMP are proposed such as Hydra [1], STAMPede [3], Multiscalar [6] and so on, many of them supports memory speculation. In these architecture models, a controller to support memory speculation exists between a processing unit and cache memory. As all data accesses from each processing unit are done via this controller, it may constitute a limiting factor of performance up brought by speculative multithreading. For example, if this controller is located between processor s L2 cache and a processing unit, the latency of the controller is added to L2 cache latency. As the L2 cache latency has large impact on the execution time of a program on the CMP, it is indispensable to minimize the latency of the controller. This paper will propose a cache coherency protocol of speculative memory access for CMP and show hardware design for the controller that supports this protocol. Then evaluate the hardware cost such as the delay and the area of the controller to investigate the controller s influence on the performance of CMP. The following chapters are organized as follows. Chapter2 introduces basic CMP architecture model we assume in this paper. In addition we define some vocabulary of terms relating to speculative multithreading. Chapter3 describes related works on hardware/software model on memory speculation. Chapter4 presents a cache coherency protocol we propose in this paper, and Chapter5 evaluates the complexity of controller hardware supporting this protocol. Finally, Chapter6 concludes this paper. 2 Baseline Model 2.1 Execution Model In the model assumed in this paper, four superscalar processor cores are on one chip as a PU (fig.2). Each PU has a private L1 cache and all data that the PU core accesses is provided via this local cache. All of these PUs are connected to a bus, and share L2 cache. Any data transmission between PUs and L2 cache is done through this bus. To support speculative multithreading, there is a thread control unit in this model. This unit controls speculative thread execution, so it allocates a thread to a PU dynamically and manages commit or squash of threads. This model supports data speculation. Though data speculation has two meanings, register speculation and memory speculation, only the latter one is supported in this model. Instead of register speculation, the value of registers are synchronized among every PUs by the Register Synchronization Unit. To execute every program by speculative multithreading on this architecture, the original sequential code is cut into threads of appropriate size by

3 ªªª Figure 2 : Overall Structure a compiler. Some ambiguous memory dependency may exist at this time, but any dependency among threads are ignored and divided into threads any way. There is unique order among these thread based on the original execution sequence. If we focus on particular thread when multiple threads are executed at the same time, we can call a thread, which should be executed earlier than it in the original program sequence, a Predecessor Thread. On the other hand, we say a Successor Thread when a thread should be executed later than the focused thread. If there is no predecessor thread at one point, the thread is nonspeculative then and its successor threads are all speculative. In addition, we define a speculation level. This level is corresponding to the number of predecessor threads. The more the number of predecessor threads becomes, the higher the speculation level becomes. For example, non-speculative thread s speculation level is 0 and the speculation level of the thread which has two predecessors is 2. Then, at execution time, the thread control unit predicts which thread should be executed by branch prediction and allocates these threads. Each threads are assigned to PU as follows. First, the non-speculative thread whose speculation level is 0 is allocated to a PU and the rest of the PUs are given other speculative threads with speculation levels 1, 2 and 3, respectively. When the execution of non-speculative thread finishes, thread that should be executed next becomes apparent. Then, if the next thread does not correspond to the thread that has been executed, that is to say, the thread speculation is failed, the thread is discarded and all modification to the contents of L1 cache during executing the thread are canceled. We can say The thread is squashed here. On the other hand, if the thread prediction is correct, the next thread whose speculation level is 1 becomes non-speculative (this means the speculation level of the thread is decremented to 0) and other successor thread s speculation level are decremented each, too. And then, new speculative thread, whose speculation level is 3, is assigned to the PU which has finished executing the non-speculative thread. If a speculative thread finishes execution before all of its predecessor threads finishes, it stalls until all predecessor threads are executed and committed. This cycle is a basic execution model and it is repeated until the code finishes. 2.2 Memory Access Model Memory access model described in this paper is based on MSI protocol, and special mechanism is added to supports memory speculation. Memory access follows a few basis below. Violation Detection To execute a program code properly without violating dependencies, violation is detected by an additional hardware. Thread Squash and Reexecution When data dependence violation is detected, the thread which violates the dependency and all of its successor threads are squashed and reexecuted. Memory Access Constraints In this model, any speculative PU modify L2 cache and only the non-speculative one can modify L2 cache. Forwarding To provide speculative PU with the newest data, the data is forwarded from other PU whenever possible. Unless any predecessor PU has relevant data, it is read from L2 cache.

4 3 Related Work Many models and hardware/software supports for thread level memory speculation have been proposed [2, 4, 5, 8 10]. Among them, [2, 4, 8] provide details on the how to implement the mechanism on CMP caches. Hydra [2] proposed an extension of cache directory to handle speculative state of the cache line. The speculation state is managed on per-line basis. A write-through policy is employed and a speculation buffer is attached to each cache to buffer speculatively stored value. STAMPede [4] extends the traditional MESI protocol with additional states to support memory speculation. Similar to Hydra, the management of speculative state is performed on per-line basis. However, it differs in that it does not require special buffer to hold speculative memory values. In [8], memory speculation is performed using a centralized table called Memory Disambiguation Table (MDT). MDT is located between the private L1 caches and the shared L2 cache. It records loads and stores executed on L1 caches. Memory management manages memory state on per-word basis. Since the entry of MDT is limited, memory operation of speculative threads need to be stalled if the table is full. This work takes a middle course from the latter two approaches. Rather than using a centralized table, we extend the MSI protocol to support memory speculation similar to STAMPede. However, we choose to manage the memory state on per word basis as in [8]. It is because previous work has shown that maintaining the state on a per-line basis results in poor performance [8]. 4 Cache Coherency Protocol 4.1 Basic Concepts In supporting speculative memory access, the most important thing is how to treat the dependency violation among threads. As long as the violation is conducted properly, the only thing we should do is to control thread such as branch prediction or thread allocation. The critical points in handling violations can be categorized into three work operations as follows. Avoiding Unnecessary Violation It is natural to think of reducing the number of violations before considering how to treat violation. There are some technique to diminish the number of violations as many as possible. Permitting existence of multi-versions of data at the same time is workable as to WAW and WAR violations. To avoid WAW and WAR violation, it is necessary to keep the order of writing memory the same as the original program sequence. This is feasible by holding multiple versions of data separately and write back them into memory in a correct order. Our model utilize the private L1 cache whose data is invisible from other PUs except when data forwarding is needed. We also exploit the shared L2 cache whose data is always visible from every PU. In this model, all of the speculative versions of data that are stored by each PUs are held in the private L1 cache. And it is limited that only non-speculative version can be written back to L2 cache. That is to say, a PU can write back its data into L2 when the thread it executes becomes non-speculative. Because all of PUs have, as said before, different speculation levels and only one PU can be non-speculative, this restriction works well in ordering write back operation. There is no speculative version in L2 cache. So it is promised that the order of writing memory is the same as the original code and therefore WAW violation cannot occur. In addition, we also limit that every data which a PU requires is provided only by the L1 of its predecessor PU or the shared L2 cache, not by its successor. These modification ensures that no WAR violation occurs because the PU cannot get from the thread that should be executed later in the original sequence. By using these scheme, WAW and WAR violations are completely prohibited. On the other hand, however, it is difficult to eliminate RAW violation absolutely. To reduce the number of RAW violation, we can use data forwarding from the nearest predecessor PU that has modified the data requested and provide as new data as possible for the PU which requests the data. But there is still possibility that RAW violation occurs because this forwarding does not assure that there will be no modification any more in the future.

5 Violation Detection As RAW violation cannot be annihilated, it is necessary to detect these violations to be free from incorrect execution results. In this model, RAW violation detection is attained by holding the history of speculative load issued by the PU. Every PU has its own table which includes speculative load history and some other information on the data in its private L1 cache. The information is associated with every data in the L1 cache. Load operation is allowed to be executed any time by any PU, and if some data is loaded speculatively during executing a thread, the fact is saved in the history table. We decide that this information is kept in the cache directory until the thread commits or squashes, so when some PU issues store data, all its successor PU can detect RAW violation dynamically by referring this information whether it has loaded the data of the same address previously in the same thread. If the history of the data shows that the data is loaded speculatively, it means that RAW dependence is violated. As these history data has one-on-one relationship with the data in L1 cache, they are blown away when the data in L1 cache is removed in case of cache displacement. It is necessary to keep these history of the loaded data during the thread until thread finishes for violation detection, so we decided to eliminate such data from the candidate of data replacement. The load history is kept until the thread commits or squashed. The important thing we should bear in mind is that it means not always RAW violation when a predecessor PU issues store to some address and the load history of the same address is found. When the speculatively loaded data was forwarded from the predecessor PU and then store operation is executed by another predecessor PU whose speculation level is lower, these sequential operations do not violate RAW dependency(figure 3). Execution Recovery To execute program code correctly by speculative multithreading, it is necessary to take proper action when RAW violation is detected, A simple and secure way to treat RAW violation is to squash the Time Th0 Store p Th1 Store q Speculation Level Higher Th2 Load r is Forwarded from Th1 If p=q=r, Violation? No stored by Th1 has already been forwarded to Th2, so the Store p does not cause RAW violation. Figure 3 : An example of speculation success. thread that violated RAW dependency and to run back all states of the data including data itself in the private L1 cache to the state at the beginning of execution. Then reexecute the thread that violated dependency. All successors of the thread that violated RAW dependency is squashed in this model because these successor may have consumed a data that had been modified by the PU which violated dependency and forwarded by it. Of course, there may be no successor which used such data and we can squash thread selectively aside from these threads, but this mechanism involves additional complex hardware. So we decided to squash the thread that violated dependency and all of its successors and reexecute them. To recover the states of all the data in the squashed PU, all state bits are reset and the data which were modified during executing the thread are invalidated. These operations are done at squash time. 4.2 Organization We integrated a hardware mechanism into each processing unit s cache controller to support threadlevel memory speculation. The organization of the controller is shown in figure 4. The controller mainly comprises four units: state controller that manages cache states, cache directory that holds cache states and speculation history, violation detector that detects memory violation, and data forwarder that controls data forwarding between PUs. The controller manages the state of memory by snooping memory events broadcasted on a shared memory bus. This bus includes a pair of data bus and address bus, and a number of control lines necessary to maintain the

6 consistency of memory (Figure 5). Thread Control Unit BUS PU 0 Requester PU Owner PU Speculation level Register Compare Speculation Level Violation Detector Snoop (Check Store) Rd Wr Invalidate Address PU0 PU1 PU2 PU3 PU0 PU1 PU2 PU3 64bit * 8 64bit 8bit 8bit 8bit 8bit Cache Directory Search Search Designate Entry Superscalar Core Compare Speculation Level Set/Chk State Load/Store Instruction State Controller Notify Store, Request, Write/Read L2 Forwarder Figure 4 : Cache controller organization. Load/Store Snoop L1 cache Forwarding 4.3 Finite State Machine To support our scheme above, we define seven states: Modified (M), Shared (S), Modified-Loaded (ML), Shared-Loaded (SL), Modified-Forwarded- Loaded (MFL), Shared-Forwarded-Loaded (SFL) and Invalid (I). These classification of these states is shown in Figure 6. These states are defined per word in the private L1 cache of each PUs. These states are categorized by focusing on four points below. Valid or Invalid First Access to the data It is important whether the first access to a certain data is load or not (store or not accessed) during executing a thread, because if the first access is load, the later access of predecessor threads to the same address may cause RAW violation. So it is necessary to keep this load history in the cache directory till the thread commits or squashed to detect violation afterward, as I said in the former section.but, unless the first access to the data is load, there is no possibility that the latter access to the same address violate the dependency. So we define that whether the first access is load or not is an ele- Figure 5 : Bus Model ment for deciding state. Forwarded from other PU or from L2 As is described in the Baseline Model(Chapter2,Section 1),the contents in the L1 cache should be rewound at the squash time to the state it was before starting the thread. It is necessary to discard any speculative data at the beginning of thread re-execution. So, we decided to invalidate all the data forwarded from predecessor threads when the PU issues speculative load during the thread. In this model, we record the history of data forwarding on the per word basis and invalidate all the words with its forwarded history logged when the thread is squashed. There is no need to record the forwarded history if the requested data is fetched from the L2 cache. This is because all data in the L2 cache is guaranteed to be right at any point by the limitation that only non-speculative thread is permitted to modify the L2 cache. So we take these difference into consideration in deciding state of data. Dirty or Not Dirty As our model is based on MSI, we distinguish the two states Dirty and not Dirty. In addition, in this model, every PU whose speculation level is lower than the data requester PU can be candidate to forward data in its local L1 cache as long as the requested data has been modified by the PU during executing a thread and is still kept in the L1 cache. So, if the data

7 Figure 7 : State Bits Figure 6 : Cache States exists in the local L1 cache but is not dirty, the data can t be forwarded. When all PUs can t forward data, the request is handed over to the shared L2 cache and the data is fetched from it. Focusing on these four points, we define 8bits. Figure 7 shows these 8 bits. First, to identify the state following these four points, four bits I, L, F and M (Invalid, Load, Forwarded, Modified bits) are defined. These bits represent Valid or Invalid, whether the first access to the each address in a thread is Load or not,forwarded from other PU or from L2, dirty or not dirty, respectively. The state encoding is given in Table 1. Table 1 : State encoding State Forwarded Load Modified Invalid M S ML SL MFL SFL I X X X 1 In addition, we arrange four bits. These bits are shown in Figure 7 as Stale, S0, S1, S2 bits. The stale bit is used for preventing old version data from being propagated to successor thread. In this model, any valid data in the private L1 cache can be used for the thread allocated by the thread control unit next time. So the next thread can inherit data from the former thread that was executed in the PU. There is a problem however, that the old version will be used by the newly allocated thread after the nonspeculative thread commits if there is a data remained in the L1 cache. Figure 8 shows the example. In this example, Th. #4 should use newer version in the PU #2 but it use the older version remained in the L1 cache. To evade this situation, we defined Delayed Invalidation (Dyinv) that invalidates the older versions in the predecessor s L1 cache when the predecessors commits. When a PU stores data, then Delayed Invalidation Message and the stored address is broadcasted on the bus to all of the predecessor. Then, every predecessor that have the correspond data sets the stale bit for the data. Then, when the predecessor thread commits, all of the data in the L1 cache whose Stale bit is set is invalidated. By this mechanism, we can hinder threads from getting older versions of data. In the example, the data whose address is A is invalidated at the end of Th0, so Th4 cannot find the data in L1 cache and request the data on the bus, and then the data is forwarded from PU Protocol Operations The actual protocol operations are done as follows Load When a thread issues Load, the private L1 cache is searched. If the data needed is found in the local L1 cache, then the data is fetched from it and the

8 PU line, so it is possible to set appropriate store bits by looking these lines. The flow chart of this operation is shown in Figure 9 «Figure 8 : Delayed Invalidation operation ends. If the data doesn t exist, however, the controller searches its cache directory to find empty or removable entry to store fetched data at first. In case of no entry, the PU stalls till an entry becomes available. Then, the controller put the destination address and set the Rd line and one of the Requester PU lines relevant to its PU number to broad cast the request on the bus. As all other PUs are snooping the bus, this request can be perceived by every PU and they check if they have the data requested. After few cycles it puts a request on the bus, the line including the requested word becomes ready on the data line (see the section Forwards below). So the PU get the whole line and store them in its private L1 cache. At the same time, it is necessary to set the appropriate store bits to know the store history of the word across the processors. This is because the store bits are used at the violation detection. When the PU detects violation, it searches the store history in the local cache directory, and not in the directory in other PU. So when a new entry is created in the local cache on the occasion of load, the history should be recorded in the local cache directory too. In this model, all PUs that have the requested data whose dirty bit is set submit their number on the Owner Store Figure 9 : Load Operation The actual store operation is done as follows. In this system, every store operation is informed to all PUs. Each PUs, which snoops this information by the cache controller, decides how to act according to the information. When a PU want to store data, it is checked whether there is correspond entry or an vacant entry in the local L1 cache firstly. If such entry is found, the data is written and the store information is broadcasted on the bus by setting the invalidate line, the requester PU line correspond to the PU number, and putting address on the address bus. In case that there is no entry to write, then the controller removes the entry that can be removed and create new entry to store data. In this model, all data except the data whose Dirty or Load bit is setcan be removed. If there is no removable entry, the PU stalls until there is one.

9 Finally, the it sets the Dirty bit of the data. The flow chart of this operation is shown in Figure 10 Word 7 M = 0 I = 0 Word 6 M = 1 I = 0 Word 5 L1 cache line (8words) M = 0 I = 1 Word 4 M = 1 I = 1 Word 3 M = 1 I = 0 Word 2 M = 1 I = 0 Word 1 M = 0 I = 1 Word 0 M = 1 I = 1 «If the cache directory of the correspond line is like this, then the controller sets its Owner PU line for word 2,3,6. Figure 11 : Example Figure 10 : Store Operation Other PU s action toward this store is explained in the subsections below Forward When a PU detect the Load request from other PU on the bus, it examines if there is the data requested in the local L1 cache. For the performance reason,we decided to handle data on a per word basis in the L1 cache. So, when a PU accesses a word and it is not found in the L1 cache, all of the words in the line which the accessed data belongs to are forwarded from other PU or L2 cache. The forwarding mechanism is on a per word basis, so each data in the line is separately searched and the newest versions in the predecessors are provided per word and merged in the requester s L1 cache. To support this scheme, the owner PU lines are arranged 32 lines (4line for each PU, 8sets for all words in the line). As this check is done on a per word basis, the controller checks the cache directory of the line which includes the requested word. If there is, then the dirty bits and invalid bits of all the word in the line are checked on a per word basis to find the data that can be forwarded. If there are some words that are found to be dirty and valid, then the PU set its Owner lines that correspond to the word number(see Figure 11). Then, the PU waits a few cycles to assure that other PUs also check whether its private data can be forwarded or not. After that, it look the Owner PU lines for each word to check whether it can forward them, that is to say, whether the PU has the newest version among the predecessors by comparing with the speculation level of all other PUs that submit their PU number on the Owner PU lines. Then if it is found to be able to forward some word in the line, it put them on the bus. The flow chart of this operation is shown in Figure Violation Detection To detect violation, the information which is broadcasted on the bus when other PU issues store is used. If a PU finds the other PU s store by snooping the invalidation line and the requester s PU line, the cache controller set the store bit correspond to the requester PU in the cache directory of the word if there exist the relevant entry. At this time, the PU s action pattern is divided into two cases according to the speculation level. First, if the speculation level is higher than the PU which issues store, then it should invalidate the correspond data if it has in its L1 cache (Normal Invalidation). Second, if the speculation level is lower than the PU which issues store,then it should set Stale bit of the correspond data in its L1 cache (we call this Delayed Invalidation). So, the speculation level is compared with the requester PU to decide which invalidation should be done, normal invalidation or delayed invalidation by searching the store bits correspond to the predecessors, whose speculation level is higher than the requester PU. If the delayed invalidation should

10 ««Figure 12 : Forwarding Operation Figure 13 : Violation Detection be done, then the stale bit is set and the operation ends. Otherwise, if the invalidation should be done, it may have violated the dependency during the thread so check the Load bit. If the Load bit is set, this means that the thread have violated the RAW dependency, so squash this thread and all of its successors. At the end, the invalid bit is set to invalidate the word. The flow chart of this operation is shown in Figure Other Operations When a thread commit is detected (this information is brought by the thread control unit), then all the store bits of committed thread are cleared by the controller clears. On the other hand, when it is informed from the thread control unit that other thread is squashed, the controller checks whether the thread is a predecessor thread. If it is, the current thread should be squashed too, because this thread may have consumed the value forwarded from the squashed thread. Of course, this thread may not have, but the hardware may become more complex to distinguish these differences and to identify thread to be squashed. So we decided to squash all successor of the squashed thread. Then, whether the squashed thread is a predecessor or not, the controller clears all the store bits of squashed thread in the local directory. The flow chart of these operations are shown in Figure 14 and Figures 15. Finally, the operations at the commit and squash

11 itself are done as follows. When a thread commits, the words whose stale bit set are searched and then the controller invalidates them. Afterwards, the stale bit and Load bit are reset. On the other hand, when a thread is squashed, the words with their forwarded bit set are invalidated in the same way. And then, the Forwarded bit and Load bit are cleared. Figure 14 : Operation when other thread commits Figure 15 : Operation when other thread squashes 5 Complexity Analysis The complexity of our cache model is analyzed in terms of hardware overhead incurred by additional memory state bits and control logic. To quantify overhead of the state bits, we estimate cache access time using CACTI tool [13]. Since, the state bits are kept in the cache directory, we compare the access time for the cache directory and for the cache itself, and discuss possible impact on cache access latency. Delay of additional control logic is estimated using the method of logical effort [12]. Using this method, the delay incurred by a logic gate is calculated as the sum of parasitic delay p and effort delay f. The effort delay is further expressed as the product of logical effort g, which describes how much bigger than an inverter a gate must be to drive loads as well as the inverter can, and electrical effort h, which is the ratio of output to input capacitance of the gate. d = f + p = gh + p (1) Delay along N-stage logic path D is given by the sum of delay through each stage: N N D = f i + p i (2) i=1 i=1 It is known that D is minimum when effort delay through each stage is equal to an optimal effort delay ˆf: ˆD = N ˆf N + p i (3) i=1 where ˆf is given by N N N ˆf = F 1/N =( g i b i h i ) 1/N (4) i=1 i=1 i=1 Here, b i is branch effort of stage i, which estimates the fanout effect of the logic gate in the stage. To estimate delay overhead of control logic, we model critical path of the logic and calculate ˆD along the path. As a measure of the delay, we use the delay through an fanout-of-four (FO4) inverter. It is known that delay normalized by the FO4 metrics holds constant over a wide range of process technologies. To provide a concrete example, absolute delay at 90nm technology will also be shown, assuming 1[FO4] = 36ps. 5.1 Cache Directory Access Time In this section, we discuss overhead incurred by cache state bits. The L1 cache configuration we assume is given in table 2. We assume that cache data,

12 Table 2 : L1 cache parameters Size 32 kb Line Size 64 bytes Associativity 2 Tag 18 bits Table 3 : Cache access time Delay Delay [ps] Tag Directory tag, and state bits are each kept in a separate memory structure, as illustrated in figure 16. The three memory arrays can be accessed in parallel. The array that holds the state bits is denoted as cache directory. Our coherence protocol requires eight state bits per word. Assuming 32kB cache with 64byte lines, this results in 4kB of cache directory in total. Because additional state bits are required to maintain speculative state and the states are kept on a per word basis, the directory size is much larger than the one for conventional coherence protocol. For example, the simplest MSI protocol only needs three state bits per line. Then the directory would be only 192 bytes in size. Word Lines Bit Lines Driver Tag Array Mux Sense Amp Sense Amp Comparator Mux Mux Driver Driver Address Decorder Array Mux Sense Amp Sense Amp Output Driver Output Figure 16 : Cache Model Cache Directory Mux Sense Amp Sense Amp Logic Circuits Control Logic Output Driver Output Table 3 shows cache access time estimated using CACTI. It can be seen that access time for cache directory is less than tag compare time. Since the directory access can be performed in parallel to tag comparison, it is expected that the directory would not affect cache access latency. In this estimation, we assume 32 bit address space. In larger address space, tag array would be larger and require more access time, so that cache directory is less likely to come on the critical path. 5.2 Control Logic Delay We estimate complexity of control logic required for four operations of the cache controller: owner probing and data transfer for data forwarding between PU, violation detection and state transition. Critical paths of each logic block are illustrated in Figure 17 20, along with their estimated delay. Functions of those logics are described briefly below. Forward - Owner Probing On receiving a forward request, each PU checks if it has the requested line in the cache. Then it checks modified bit for each word in the line, and if the bit is set, asserts owner PU line to claim its ownership. Forward - Transfer Each PU examines which PUs have claimed ownership for the requested words. It compares speculation level of all the owner PUs, and checks if it is qualified to forward the word. If it is, it puts the word on the bus, and if not, send a request to L2 cache. Violation Detection On receiving a broadcast that a store has been executed, each PU first identifies which PUs have stored to the corresponding word by checking its store bits. It then compares speculation level of those PUs with its own speculation level and with that of the latest store. Finally it checks modified bit and load bit of the word and detects violation. State Transition Forwarded bit, Load bit, Modified bit, and Invalid bit are managed according to processor actions and messages from the bus. Table 4 summarizes estimated time needed for each operation. The operation times include access time to cache directory, which has been previously estimated

13 Address 64bit Bus Request Cache Directory Hit Modified Stage 1 g = 1 p = 1 Cin = C Stage 2 g = 7/3 Cout = kc Compare Speculation Level for each word Set Owner PU line To Bus To Bus (Owner PU line) (Owner PU kc line for each word) Figure 17 : Forward - Owner Probing F = 56k/3 f = 4.32 * k ½ P = 4 D = 8.64 * k ½ FO4 Delay = 5 if k = 1, D = / 5 = 2.5 FO4 Cache Directory access time = 11.4 FO4 Total Delay = 13.9 FO4 Table 4 : Operation delay including access time to cache directory Delay Delay [ps] Owner Probing Transfer Violation Detection State Transition to be 11.4[FO4] (table 3). It can be seen that directory overhead occupies a large part of path delay. The results also indicate that the control logic may slightly extend cache access latency shown in table 3. Overall, however, it is estimated that the cache controller can operate at a reasonable cycle time of less than 20[FO4]. 6 Conclusion A number of memory speculation mechanisms in speculative multithreading chip multiprocessors have been proposed and many performance studies have been reported. However, it is still doubtful whether the mechanism is complexity effective to implement. F = k / 6561 f = * k 1/8 P = 17 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 7.3 FO4 Stage 1 g = 1 p = 1 Cin = C Stage 1 g = 1 p = 1 Cin = C Select the speculation level correspond to the owner PU Stage 2 p = 2 Owner PU line Store bit Cache Directory Compare the speculation level to check if the speculation level is highest among the owner PUs and if the PU is a predecessor of the requester PU Stage 3 g = 7/3 Stage 4 g = 7/3 L1 cache Bus Stage 5 g = 7/3 Stage 6 g = 4/3 p = 2 64 bit ENB Stage 7 g = 1 p = 1 ENB ENB ENB ENB ENB ENB Stage 8 p = 2 Cout = kc Put data on the bus if possible Bus Figure 18 : Forward - Transfer Select Speculation Level Correspond to Store bit Stage 2 g = 4/3 p = 2 Compare speculation level if there is a predecessor whose speculation level is higher than PU that issued store Stage 3 Stage 4 Stage 5 Stage 6 g = 2 p = 4 F = k / 243 f = * k 1/7 P = 18 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 7.16 FO4 Cache Directory access time = 11.4 FO4 Total Delay = 18.6 FO4 Figure 19 : Violation Detection Check if the PU loaded the data in the thread No data can be forwarded To L2 Cache kc Stage 7 p = 2 Cout = kc Load bit Violation! To TCU In this paper, we perform a complexity analysis of a cache controller designed by extending an MSI controller to support thread-level memory speculation. We model and estimate the delay of logic on critical paths and additional area overhead to hold additional control bits in cache directory. We found that the increase in area of cache directory over the original MSI cache directory is significant. This is a consequence of the requirements to maintain speculative state of memory in per-word basis rather then in per-line basis. For many protocol operations, area overhead occupies more than half of the total delay. This area overhead is however smaller than the delay for accessing and comparing cache tags. Since cache directory access and protocol logic operation can be performed in parallel with cache tag access, significant increase in critical path kc

14 Stage 1 g = 1 p = 1 Cin = C Cache Directory Load Stage 2 g = 3 p = 4 Stage 3 Stage 4 g = 7/3 p = 5 Stage 5 g = 7/3 Stage 6 g = 7/3 p = 5 Stage 7 p = 2 Select In MUX F = k / 2430 f = * k 1/8 P = 29 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 9.8 FO4 Cache directory access time = 11.4 FO4 Total Delay =21.2 FO4 Figure 20 : State Transition Stage 8 g = 2 p = 6 Cout = kc To Directory delay can be avoided. Finally, the results also indicate that the cache controller can be implemented with reasonable speed (cycle time less than 20 [FO4]), although in such a case, some operations must be divided into several cycles. References [1] Lance Hammond, Benedict A.Hubbert, Michael Siu, Manohar K.Prabhu, Michael Chen and Kunle Olukotun, The Stanford Hydra CMP, IEEE MICRO Magazine, March-April 2000, and presented at Hot Chips 11, August [2] Lance Hammond, Mark Willey and Kunle Olukotun, Speculation Support for a Chip Multiprocessor, Proceedings Of The 8th International Symposium on Architectural Support for Parallel Languages and Operating Systems (AS- PLOS), [3] J. Gregory Steffan, Christopher B. Colohan and Todd C. Mowry, Extending Cache Coherence to Support Thread-Level Speculation on a Single Chip and Beyond, Technical Report CMU-CS , School of Computer Science, Carnegie Mellon University, December [4] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai and Todd C. Mowry, A Scalable Approach to Thread-Level Speculation, Proceedings Of The 27th International Symposium on Computer Architecture (ISCA), kc [5] M. Franklin and G. S. Sohi. ARB: A Hardware Mechanism for Dynamic Reordering of Memory References, IEEE Transactions on Computers, 45(5): , [6] Gurindar S.Sohi, Scortt E.Breach and T.N.Vijaykumar, Multiscalar Processor, 22th International Symposium on Computer Architecture (ISCA-22), [7] Venkata Krishnan and Josep Torrellas, Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip- Multiprocessor, International Conference on Supercomputing (ICS), July [8] Venkata Krishnan and Josep Torrellas, A Chip- Multiprocessor Architecture with Speculative Multithreading, IEEE Transactions on Computers, VOL.48, NO.9, [9] Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja and Pen-Chung Yew, A Chip- Multiprocessor Architecture with Speculative Multithreading, IEEE Transactions on Computers, VOL.48, NO.9, [10] Sridhar Gopal, T. N. Vijaykumar, James E. Smith and Gurindar S. Sohi, Speculative Versioning Cache, Proceedings Of The 4th International Symposium on High-Performance Computer Architecture (HPCA), [11] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz, The Future of Wires, Proceedings Of The Ieee, VOL.89, NO.4, April [12] I. Sutherland, B. Sproull, and D. Harris, Logical Effort, San Francisco:Morgan Kaufmann, [13] S. Wilton and N. Jouppi, An enhanced access and cycle time model for on-chip caches, Technical Report 93/5, Digital Western Research Laboratory, July 1994.

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although