Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

Size: px
Start display at page:

Download "Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors"

Transcription

1 Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa 1 Introduction 1.1 Backgrounds Chip Multiprocessors With rapidly improving technology of semiconductor, we can integrate large number of transistors on one silicon chip. To get higher performance, a superscalar architecture which utilizes Instruction Level Parallelism (ILP) in a program code is widely used today. Thanks to enormous transistors on a chip, a wider issue width superscalar processor with larger cache have been built to obtain higher performance. However, because ILP in a code is limited, it is difficult to increase performance simply by parallelizing instructions in a code. In addition, the more complex a superscalar processor is, the more expensive its developing cost becomes. To solve this problem, another architecture called Chip Multiproceoosr(CMP) has been proposed. In a CMP architecture, a superscalar processor is duplicated across the chip as multiple Processing Units (PU). By executing multiple threads simultaneously on each PUs, a CMP utilize the Thread Level Parallelism (TLP). Therefore, a CMP has a potential to achieve good performance for multithreaded program. In addition to this performance increase, CMP has another advantage. Each processor core built in CMP needs not to be as complex as an up-to-date superscalar processor to carry out the same performance, because it can employ TLP in addition to ILP at execution time. Also, by duplicating the core, we can design CMP more easily than a complex single core superscalar processor. This simplicity contributes to contraction of designing term and cost cutting Speculative Multithreading Except for numerical programs, many program codes for general purpose are integer program. In general, these integer programs have little potential to be parallelized and if we cut such program code into threads, it may yield dependency among these treads. Because of these dependencies, it is impossible to execute these interdependent threads simultaneously and this fact makes it difficult to efficiently run an integer program on CMP. Speculative multithreading is a technique that execute multiple threads at the same time even if there are some dependencies. This technique enables us to extract more parallelism even from integer programs by ignoring dependencies and to execute them effectively on CMP using multiple cores simultaneously. Thanks to speculative multithreading, we can execute many kinds of programs on CMP efficiently at some level. One important point in speculative multithreading is violation detection. In speculative multithreading, threads are executed all at once without taking care of dependency among threads, so there may be violation. If dependency violation occurs, the execution result may be incorrect, so it is indispensable to detect violation and treat it properly afterwards Memory Speculation Memory speculation is one of the key components of speculative multithreading. An example of memory speculation is shown in fig. 1(a). In this example, original sequential code is divided into two threads (Th0 and Th1) and these threads are allocated to PU2 and PU3, respectively. The Th0 should be executed before Th1 is executed according to the original code they are cut from. When Th0 is to be executed earlier than Th1 in the original sequence, we can say The

2 speculation level of Th1 is higher than that of Th0. There is an instruction Store in the Th0 and its destination address is p, while an instruction Load is in the Th1 and its destination address is q. These destination address are not decidable at compile time. The dependency between these two threads is ambiguous. In speculative multithreading, as depicted in fig. 1(a), multiple threads are executed at the same time. In this case, if the address p equals to q, the two instructions Store and Load are executed in reverse order in the time-line. These two instructions may be interdependent, but the Load is executed speculatively. This is memory speculation. PU2 PU3 PU2 PU3 PU3 Th0 Time Th0 Store p Th1 Load q (a) Memory Speculation If p=q, Violation! Store p Th1 Load q Squash thread (b) Thread squash and re-execution Figure 1 : Speculative Multithreading Re-execute Th1 Load q If the destination addresses are the same (p = q), Th1 may load UN-correct data and returns an irrelevant result (fig. 1(b)). This is called RAW violation. It is impossible to avoid violation in speculative multithreading completely. So, when some violation happens, some treatment such as squash and reexecution of the thread should be done in order to execute code properly. 1.2 Research Objectives Recently, many architecture models for speculative multithreading for CMP are proposed such as Hydra [1], STAMPede [3], Multiscalar [6] and so on, many of them supports memory speculation. In these architecture models, a controller to support memory speculation exists between a processing unit and cache memory. As all data accesses from each processing unit are done via this controller, it may constitute a limiting factor of performance up brought by speculative multithreading. For example, if this controller is located between processor s L2 cache and a processing unit, the latency of the controller is added to L2 cache latency. As the L2 cache latency has large impact on the execution time of a program on the CMP, it is indispensable to minimize the latency of the controller. This paper will propose a cache coherency protocol of speculative memory access for CMP and show hardware design for the controller that supports this protocol. Then evaluate the hardware cost such as the delay and the area of the controller to investigate the controller s influence on the performance of CMP. The following chapters are organized as follows. Chapter2 introduces basic CMP architecture model we assume in this paper. In addition we define some vocabulary of terms relating to speculative multithreading. Chapter3 describes related works on hardware/software model on memory speculation. Chapter4 presents a cache coherency protocol we propose in this paper, and Chapter5 evaluates the complexity of controller hardware supporting this protocol. Finally, Chapter6 concludes this paper. 2 Baseline Model 2.1 Execution Model In the model assumed in this paper, four superscalar processor cores are on one chip as a PU (fig.2). Each PU has a private L1 cache and all data that the PU core accesses is provided via this local cache. All of these PUs are connected to a bus, and share L2 cache. Any data transmission between PUs and L2 cache is done through this bus. To support speculative multithreading, there is a thread control unit in this model. This unit controls speculative thread execution, so it allocates a thread to a PU dynamically and manages commit or squash of threads. This model supports data speculation. Though data speculation has two meanings, register speculation and memory speculation, only the latter one is supported in this model. Instead of register speculation, the value of registers are synchronized among every PUs by the Register Synchronization Unit. To execute every program by speculative multithreading on this architecture, the original sequential code is cut into threads of appropriate size by

3 ªªª Figure 2 : Overall Structure a compiler. Some ambiguous memory dependency may exist at this time, but any dependency among threads are ignored and divided into threads any way. There is unique order among these thread based on the original execution sequence. If we focus on particular thread when multiple threads are executed at the same time, we can call a thread, which should be executed earlier than it in the original program sequence, a Predecessor Thread. On the other hand, we say a Successor Thread when a thread should be executed later than the focused thread. If there is no predecessor thread at one point, the thread is nonspeculative then and its successor threads are all speculative. In addition, we define a speculation level. This level is corresponding to the number of predecessor threads. The more the number of predecessor threads becomes, the higher the speculation level becomes. For example, non-speculative thread s speculation level is 0 and the speculation level of the thread which has two predecessors is 2. Then, at execution time, the thread control unit predicts which thread should be executed by branch prediction and allocates these threads. Each threads are assigned to PU as follows. First, the non-speculative thread whose speculation level is 0 is allocated to a PU and the rest of the PUs are given other speculative threads with speculation levels 1, 2 and 3, respectively. When the execution of non-speculative thread finishes, thread that should be executed next becomes apparent. Then, if the next thread does not correspond to the thread that has been executed, that is to say, the thread speculation is failed, the thread is discarded and all modification to the contents of L1 cache during executing the thread are canceled. We can say The thread is squashed here. On the other hand, if the thread prediction is correct, the next thread whose speculation level is 1 becomes non-speculative (this means the speculation level of the thread is decremented to 0) and other successor thread s speculation level are decremented each, too. And then, new speculative thread, whose speculation level is 3, is assigned to the PU which has finished executing the non-speculative thread. If a speculative thread finishes execution before all of its predecessor threads finishes, it stalls until all predecessor threads are executed and committed. This cycle is a basic execution model and it is repeated until the code finishes. 2.2 Memory Access Model Memory access model described in this paper is based on MSI protocol, and special mechanism is added to supports memory speculation. Memory access follows a few basis below. Violation Detection To execute a program code properly without violating dependencies, violation is detected by an additional hardware. Thread Squash and Reexecution When data dependence violation is detected, the thread which violates the dependency and all of its successor threads are squashed and reexecuted. Memory Access Constraints In this model, any speculative PU modify L2 cache and only the non-speculative one can modify L2 cache. Forwarding To provide speculative PU with the newest data, the data is forwarded from other PU whenever possible. Unless any predecessor PU has relevant data, it is read from L2 cache.

4 3 Related Work Many models and hardware/software supports for thread level memory speculation have been proposed [2, 4, 5, 8 10]. Among them, [2, 4, 8] provide details on the how to implement the mechanism on CMP caches. Hydra [2] proposed an extension of cache directory to handle speculative state of the cache line. The speculation state is managed on per-line basis. A write-through policy is employed and a speculation buffer is attached to each cache to buffer speculatively stored value. STAMPede [4] extends the traditional MESI protocol with additional states to support memory speculation. Similar to Hydra, the management of speculative state is performed on per-line basis. However, it differs in that it does not require special buffer to hold speculative memory values. In [8], memory speculation is performed using a centralized table called Memory Disambiguation Table (MDT). MDT is located between the private L1 caches and the shared L2 cache. It records loads and stores executed on L1 caches. Memory management manages memory state on per-word basis. Since the entry of MDT is limited, memory operation of speculative threads need to be stalled if the table is full. This work takes a middle course from the latter two approaches. Rather than using a centralized table, we extend the MSI protocol to support memory speculation similar to STAMPede. However, we choose to manage the memory state on per word basis as in [8]. It is because previous work has shown that maintaining the state on a per-line basis results in poor performance [8]. 4 Cache Coherency Protocol 4.1 Basic Concepts In supporting speculative memory access, the most important thing is how to treat the dependency violation among threads. As long as the violation is conducted properly, the only thing we should do is to control thread such as branch prediction or thread allocation. The critical points in handling violations can be categorized into three work operations as follows. Avoiding Unnecessary Violation It is natural to think of reducing the number of violations before considering how to treat violation. There are some technique to diminish the number of violations as many as possible. Permitting existence of multi-versions of data at the same time is workable as to WAW and WAR violations. To avoid WAW and WAR violation, it is necessary to keep the order of writing memory the same as the original program sequence. This is feasible by holding multiple versions of data separately and write back them into memory in a correct order. Our model utilize the private L1 cache whose data is invisible from other PUs except when data forwarding is needed. We also exploit the shared L2 cache whose data is always visible from every PU. In this model, all of the speculative versions of data that are stored by each PUs are held in the private L1 cache. And it is limited that only non-speculative version can be written back to L2 cache. That is to say, a PU can write back its data into L2 when the thread it executes becomes non-speculative. Because all of PUs have, as said before, different speculation levels and only one PU can be non-speculative, this restriction works well in ordering write back operation. There is no speculative version in L2 cache. So it is promised that the order of writing memory is the same as the original code and therefore WAW violation cannot occur. In addition, we also limit that every data which a PU requires is provided only by the L1 of its predecessor PU or the shared L2 cache, not by its successor. These modification ensures that no WAR violation occurs because the PU cannot get from the thread that should be executed later in the original sequence. By using these scheme, WAW and WAR violations are completely prohibited. On the other hand, however, it is difficult to eliminate RAW violation absolutely. To reduce the number of RAW violation, we can use data forwarding from the nearest predecessor PU that has modified the data requested and provide as new data as possible for the PU which requests the data. But there is still possibility that RAW violation occurs because this forwarding does not assure that there will be no modification any more in the future.

5 Violation Detection As RAW violation cannot be annihilated, it is necessary to detect these violations to be free from incorrect execution results. In this model, RAW violation detection is attained by holding the history of speculative load issued by the PU. Every PU has its own table which includes speculative load history and some other information on the data in its private L1 cache. The information is associated with every data in the L1 cache. Load operation is allowed to be executed any time by any PU, and if some data is loaded speculatively during executing a thread, the fact is saved in the history table. We decide that this information is kept in the cache directory until the thread commits or squashes, so when some PU issues store data, all its successor PU can detect RAW violation dynamically by referring this information whether it has loaded the data of the same address previously in the same thread. If the history of the data shows that the data is loaded speculatively, it means that RAW dependence is violated. As these history data has one-on-one relationship with the data in L1 cache, they are blown away when the data in L1 cache is removed in case of cache displacement. It is necessary to keep these history of the loaded data during the thread until thread finishes for violation detection, so we decided to eliminate such data from the candidate of data replacement. The load history is kept until the thread commits or squashed. The important thing we should bear in mind is that it means not always RAW violation when a predecessor PU issues store to some address and the load history of the same address is found. When the speculatively loaded data was forwarded from the predecessor PU and then store operation is executed by another predecessor PU whose speculation level is lower, these sequential operations do not violate RAW dependency(figure 3). Execution Recovery To execute program code correctly by speculative multithreading, it is necessary to take proper action when RAW violation is detected, A simple and secure way to treat RAW violation is to squash the Time Th0 Store p Th1 Store q Speculation Level Higher Th2 Load r is Forwarded from Th1 If p=q=r, Violation? No stored by Th1 has already been forwarded to Th2, so the Store p does not cause RAW violation. Figure 3 : An example of speculation success. thread that violated RAW dependency and to run back all states of the data including data itself in the private L1 cache to the state at the beginning of execution. Then reexecute the thread that violated dependency. All successors of the thread that violated RAW dependency is squashed in this model because these successor may have consumed a data that had been modified by the PU which violated dependency and forwarded by it. Of course, there may be no successor which used such data and we can squash thread selectively aside from these threads, but this mechanism involves additional complex hardware. So we decided to squash the thread that violated dependency and all of its successors and reexecute them. To recover the states of all the data in the squashed PU, all state bits are reset and the data which were modified during executing the thread are invalidated. These operations are done at squash time. 4.2 Organization We integrated a hardware mechanism into each processing unit s cache controller to support threadlevel memory speculation. The organization of the controller is shown in figure 4. The controller mainly comprises four units: state controller that manages cache states, cache directory that holds cache states and speculation history, violation detector that detects memory violation, and data forwarder that controls data forwarding between PUs. The controller manages the state of memory by snooping memory events broadcasted on a shared memory bus. This bus includes a pair of data bus and address bus, and a number of control lines necessary to maintain the

6 consistency of memory (Figure 5). Thread Control Unit BUS PU 0 Requester PU Owner PU Speculation level Register Compare Speculation Level Violation Detector Snoop (Check Store) Rd Wr Invalidate Address PU0 PU1 PU2 PU3 PU0 PU1 PU2 PU3 64bit * 8 64bit 8bit 8bit 8bit 8bit Cache Directory Search Search Designate Entry Superscalar Core Compare Speculation Level Set/Chk State Load/Store Instruction State Controller Notify Store, Request, Write/Read L2 Forwarder Figure 4 : Cache controller organization. Load/Store Snoop L1 cache Forwarding 4.3 Finite State Machine To support our scheme above, we define seven states: Modified (M), Shared (S), Modified-Loaded (ML), Shared-Loaded (SL), Modified-Forwarded- Loaded (MFL), Shared-Forwarded-Loaded (SFL) and Invalid (I). These classification of these states is shown in Figure 6. These states are defined per word in the private L1 cache of each PUs. These states are categorized by focusing on four points below. Valid or Invalid First Access to the data It is important whether the first access to a certain data is load or not (store or not accessed) during executing a thread, because if the first access is load, the later access of predecessor threads to the same address may cause RAW violation. So it is necessary to keep this load history in the cache directory till the thread commits or squashed to detect violation afterward, as I said in the former section.but, unless the first access to the data is load, there is no possibility that the latter access to the same address violate the dependency. So we define that whether the first access is load or not is an ele- Figure 5 : Bus Model ment for deciding state. Forwarded from other PU or from L2 As is described in the Baseline Model(Chapter2,Section 1),the contents in the L1 cache should be rewound at the squash time to the state it was before starting the thread. It is necessary to discard any speculative data at the beginning of thread re-execution. So, we decided to invalidate all the data forwarded from predecessor threads when the PU issues speculative load during the thread. In this model, we record the history of data forwarding on the per word basis and invalidate all the words with its forwarded history logged when the thread is squashed. There is no need to record the forwarded history if the requested data is fetched from the L2 cache. This is because all data in the L2 cache is guaranteed to be right at any point by the limitation that only non-speculative thread is permitted to modify the L2 cache. So we take these difference into consideration in deciding state of data. Dirty or Not Dirty As our model is based on MSI, we distinguish the two states Dirty and not Dirty. In addition, in this model, every PU whose speculation level is lower than the data requester PU can be candidate to forward data in its local L1 cache as long as the requested data has been modified by the PU during executing a thread and is still kept in the L1 cache. So, if the data

7 Figure 7 : State Bits Figure 6 : Cache States exists in the local L1 cache but is not dirty, the data can t be forwarded. When all PUs can t forward data, the request is handed over to the shared L2 cache and the data is fetched from it. Focusing on these four points, we define 8bits. Figure 7 shows these 8 bits. First, to identify the state following these four points, four bits I, L, F and M (Invalid, Load, Forwarded, Modified bits) are defined. These bits represent Valid or Invalid, whether the first access to the each address in a thread is Load or not,forwarded from other PU or from L2, dirty or not dirty, respectively. The state encoding is given in Table 1. Table 1 : State encoding State Forwarded Load Modified Invalid M S ML SL MFL SFL I X X X 1 In addition, we arrange four bits. These bits are shown in Figure 7 as Stale, S0, S1, S2 bits. The stale bit is used for preventing old version data from being propagated to successor thread. In this model, any valid data in the private L1 cache can be used for the thread allocated by the thread control unit next time. So the next thread can inherit data from the former thread that was executed in the PU. There is a problem however, that the old version will be used by the newly allocated thread after the nonspeculative thread commits if there is a data remained in the L1 cache. Figure 8 shows the example. In this example, Th. #4 should use newer version in the PU #2 but it use the older version remained in the L1 cache. To evade this situation, we defined Delayed Invalidation (Dyinv) that invalidates the older versions in the predecessor s L1 cache when the predecessors commits. When a PU stores data, then Delayed Invalidation Message and the stored address is broadcasted on the bus to all of the predecessor. Then, every predecessor that have the correspond data sets the stale bit for the data. Then, when the predecessor thread commits, all of the data in the L1 cache whose Stale bit is set is invalidated. By this mechanism, we can hinder threads from getting older versions of data. In the example, the data whose address is A is invalidated at the end of Th0, so Th4 cannot find the data in L1 cache and request the data on the bus, and then the data is forwarded from PU Protocol Operations The actual protocol operations are done as follows Load When a thread issues Load, the private L1 cache is searched. If the data needed is found in the local L1 cache, then the data is fetched from it and the

8 PU line, so it is possible to set appropriate store bits by looking these lines. The flow chart of this operation is shown in Figure 9 «Figure 8 : Delayed Invalidation operation ends. If the data doesn t exist, however, the controller searches its cache directory to find empty or removable entry to store fetched data at first. In case of no entry, the PU stalls till an entry becomes available. Then, the controller put the destination address and set the Rd line and one of the Requester PU lines relevant to its PU number to broad cast the request on the bus. As all other PUs are snooping the bus, this request can be perceived by every PU and they check if they have the data requested. After few cycles it puts a request on the bus, the line including the requested word becomes ready on the data line (see the section Forwards below). So the PU get the whole line and store them in its private L1 cache. At the same time, it is necessary to set the appropriate store bits to know the store history of the word across the processors. This is because the store bits are used at the violation detection. When the PU detects violation, it searches the store history in the local cache directory, and not in the directory in other PU. So when a new entry is created in the local cache on the occasion of load, the history should be recorded in the local cache directory too. In this model, all PUs that have the requested data whose dirty bit is set submit their number on the Owner Store Figure 9 : Load Operation The actual store operation is done as follows. In this system, every store operation is informed to all PUs. Each PUs, which snoops this information by the cache controller, decides how to act according to the information. When a PU want to store data, it is checked whether there is correspond entry or an vacant entry in the local L1 cache firstly. If such entry is found, the data is written and the store information is broadcasted on the bus by setting the invalidate line, the requester PU line correspond to the PU number, and putting address on the address bus. In case that there is no entry to write, then the controller removes the entry that can be removed and create new entry to store data. In this model, all data except the data whose Dirty or Load bit is setcan be removed. If there is no removable entry, the PU stalls until there is one.

9 Finally, the it sets the Dirty bit of the data. The flow chart of this operation is shown in Figure 10 Word 7 M = 0 I = 0 Word 6 M = 1 I = 0 Word 5 L1 cache line (8words) M = 0 I = 1 Word 4 M = 1 I = 1 Word 3 M = 1 I = 0 Word 2 M = 1 I = 0 Word 1 M = 0 I = 1 Word 0 M = 1 I = 1 «If the cache directory of the correspond line is like this, then the controller sets its Owner PU line for word 2,3,6. Figure 11 : Example Figure 10 : Store Operation Other PU s action toward this store is explained in the subsections below Forward When a PU detect the Load request from other PU on the bus, it examines if there is the data requested in the local L1 cache. For the performance reason,we decided to handle data on a per word basis in the L1 cache. So, when a PU accesses a word and it is not found in the L1 cache, all of the words in the line which the accessed data belongs to are forwarded from other PU or L2 cache. The forwarding mechanism is on a per word basis, so each data in the line is separately searched and the newest versions in the predecessors are provided per word and merged in the requester s L1 cache. To support this scheme, the owner PU lines are arranged 32 lines (4line for each PU, 8sets for all words in the line). As this check is done on a per word basis, the controller checks the cache directory of the line which includes the requested word. If there is, then the dirty bits and invalid bits of all the word in the line are checked on a per word basis to find the data that can be forwarded. If there are some words that are found to be dirty and valid, then the PU set its Owner lines that correspond to the word number(see Figure 11). Then, the PU waits a few cycles to assure that other PUs also check whether its private data can be forwarded or not. After that, it look the Owner PU lines for each word to check whether it can forward them, that is to say, whether the PU has the newest version among the predecessors by comparing with the speculation level of all other PUs that submit their PU number on the Owner PU lines. Then if it is found to be able to forward some word in the line, it put them on the bus. The flow chart of this operation is shown in Figure Violation Detection To detect violation, the information which is broadcasted on the bus when other PU issues store is used. If a PU finds the other PU s store by snooping the invalidation line and the requester s PU line, the cache controller set the store bit correspond to the requester PU in the cache directory of the word if there exist the relevant entry. At this time, the PU s action pattern is divided into two cases according to the speculation level. First, if the speculation level is higher than the PU which issues store, then it should invalidate the correspond data if it has in its L1 cache (Normal Invalidation). Second, if the speculation level is lower than the PU which issues store,then it should set Stale bit of the correspond data in its L1 cache (we call this Delayed Invalidation). So, the speculation level is compared with the requester PU to decide which invalidation should be done, normal invalidation or delayed invalidation by searching the store bits correspond to the predecessors, whose speculation level is higher than the requester PU. If the delayed invalidation should

10 ««Figure 12 : Forwarding Operation Figure 13 : Violation Detection be done, then the stale bit is set and the operation ends. Otherwise, if the invalidation should be done, it may have violated the dependency during the thread so check the Load bit. If the Load bit is set, this means that the thread have violated the RAW dependency, so squash this thread and all of its successors. At the end, the invalid bit is set to invalidate the word. The flow chart of this operation is shown in Figure Other Operations When a thread commit is detected (this information is brought by the thread control unit), then all the store bits of committed thread are cleared by the controller clears. On the other hand, when it is informed from the thread control unit that other thread is squashed, the controller checks whether the thread is a predecessor thread. If it is, the current thread should be squashed too, because this thread may have consumed the value forwarded from the squashed thread. Of course, this thread may not have, but the hardware may become more complex to distinguish these differences and to identify thread to be squashed. So we decided to squash all successor of the squashed thread. Then, whether the squashed thread is a predecessor or not, the controller clears all the store bits of squashed thread in the local directory. The flow chart of these operations are shown in Figure 14 and Figures 15. Finally, the operations at the commit and squash

11 itself are done as follows. When a thread commits, the words whose stale bit set are searched and then the controller invalidates them. Afterwards, the stale bit and Load bit are reset. On the other hand, when a thread is squashed, the words with their forwarded bit set are invalidated in the same way. And then, the Forwarded bit and Load bit are cleared. Figure 14 : Operation when other thread commits Figure 15 : Operation when other thread squashes 5 Complexity Analysis The complexity of our cache model is analyzed in terms of hardware overhead incurred by additional memory state bits and control logic. To quantify overhead of the state bits, we estimate cache access time using CACTI tool [13]. Since, the state bits are kept in the cache directory, we compare the access time for the cache directory and for the cache itself, and discuss possible impact on cache access latency. Delay of additional control logic is estimated using the method of logical effort [12]. Using this method, the delay incurred by a logic gate is calculated as the sum of parasitic delay p and effort delay f. The effort delay is further expressed as the product of logical effort g, which describes how much bigger than an inverter a gate must be to drive loads as well as the inverter can, and electrical effort h, which is the ratio of output to input capacitance of the gate. d = f + p = gh + p (1) Delay along N-stage logic path D is given by the sum of delay through each stage: N N D = f i + p i (2) i=1 i=1 It is known that D is minimum when effort delay through each stage is equal to an optimal effort delay ˆf: ˆD = N ˆf N + p i (3) i=1 where ˆf is given by N N N ˆf = F 1/N =( g i b i h i ) 1/N (4) i=1 i=1 i=1 Here, b i is branch effort of stage i, which estimates the fanout effect of the logic gate in the stage. To estimate delay overhead of control logic, we model critical path of the logic and calculate ˆD along the path. As a measure of the delay, we use the delay through an fanout-of-four (FO4) inverter. It is known that delay normalized by the FO4 metrics holds constant over a wide range of process technologies. To provide a concrete example, absolute delay at 90nm technology will also be shown, assuming 1[FO4] = 36ps. 5.1 Cache Directory Access Time In this section, we discuss overhead incurred by cache state bits. The L1 cache configuration we assume is given in table 2. We assume that cache data,

12 Table 2 : L1 cache parameters Size 32 kb Line Size 64 bytes Associativity 2 Tag 18 bits Table 3 : Cache access time Delay Delay [ps] Tag Directory tag, and state bits are each kept in a separate memory structure, as illustrated in figure 16. The three memory arrays can be accessed in parallel. The array that holds the state bits is denoted as cache directory. Our coherence protocol requires eight state bits per word. Assuming 32kB cache with 64byte lines, this results in 4kB of cache directory in total. Because additional state bits are required to maintain speculative state and the states are kept on a per word basis, the directory size is much larger than the one for conventional coherence protocol. For example, the simplest MSI protocol only needs three state bits per line. Then the directory would be only 192 bytes in size. Word Lines Bit Lines Driver Tag Array Mux Sense Amp Sense Amp Comparator Mux Mux Driver Driver Address Decorder Array Mux Sense Amp Sense Amp Output Driver Output Figure 16 : Cache Model Cache Directory Mux Sense Amp Sense Amp Logic Circuits Control Logic Output Driver Output Table 3 shows cache access time estimated using CACTI. It can be seen that access time for cache directory is less than tag compare time. Since the directory access can be performed in parallel to tag comparison, it is expected that the directory would not affect cache access latency. In this estimation, we assume 32 bit address space. In larger address space, tag array would be larger and require more access time, so that cache directory is less likely to come on the critical path. 5.2 Control Logic Delay We estimate complexity of control logic required for four operations of the cache controller: owner probing and data transfer for data forwarding between PU, violation detection and state transition. Critical paths of each logic block are illustrated in Figure 17 20, along with their estimated delay. Functions of those logics are described briefly below. Forward - Owner Probing On receiving a forward request, each PU checks if it has the requested line in the cache. Then it checks modified bit for each word in the line, and if the bit is set, asserts owner PU line to claim its ownership. Forward - Transfer Each PU examines which PUs have claimed ownership for the requested words. It compares speculation level of all the owner PUs, and checks if it is qualified to forward the word. If it is, it puts the word on the bus, and if not, send a request to L2 cache. Violation Detection On receiving a broadcast that a store has been executed, each PU first identifies which PUs have stored to the corresponding word by checking its store bits. It then compares speculation level of those PUs with its own speculation level and with that of the latest store. Finally it checks modified bit and load bit of the word and detects violation. State Transition Forwarded bit, Load bit, Modified bit, and Invalid bit are managed according to processor actions and messages from the bus. Table 4 summarizes estimated time needed for each operation. The operation times include access time to cache directory, which has been previously estimated

13 Address 64bit Bus Request Cache Directory Hit Modified Stage 1 g = 1 p = 1 Cin = C Stage 2 g = 7/3 Cout = kc Compare Speculation Level for each word Set Owner PU line To Bus To Bus (Owner PU line) (Owner PU kc line for each word) Figure 17 : Forward - Owner Probing F = 56k/3 f = 4.32 * k ½ P = 4 D = 8.64 * k ½ FO4 Delay = 5 if k = 1, D = / 5 = 2.5 FO4 Cache Directory access time = 11.4 FO4 Total Delay = 13.9 FO4 Table 4 : Operation delay including access time to cache directory Delay Delay [ps] Owner Probing Transfer Violation Detection State Transition to be 11.4[FO4] (table 3). It can be seen that directory overhead occupies a large part of path delay. The results also indicate that the control logic may slightly extend cache access latency shown in table 3. Overall, however, it is estimated that the cache controller can operate at a reasonable cycle time of less than 20[FO4]. 6 Conclusion A number of memory speculation mechanisms in speculative multithreading chip multiprocessors have been proposed and many performance studies have been reported. However, it is still doubtful whether the mechanism is complexity effective to implement. F = k / 6561 f = * k 1/8 P = 17 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 7.3 FO4 Stage 1 g = 1 p = 1 Cin = C Stage 1 g = 1 p = 1 Cin = C Select the speculation level correspond to the owner PU Stage 2 p = 2 Owner PU line Store bit Cache Directory Compare the speculation level to check if the speculation level is highest among the owner PUs and if the PU is a predecessor of the requester PU Stage 3 g = 7/3 Stage 4 g = 7/3 L1 cache Bus Stage 5 g = 7/3 Stage 6 g = 4/3 p = 2 64 bit ENB Stage 7 g = 1 p = 1 ENB ENB ENB ENB ENB ENB Stage 8 p = 2 Cout = kc Put data on the bus if possible Bus Figure 18 : Forward - Transfer Select Speculation Level Correspond to Store bit Stage 2 g = 4/3 p = 2 Compare speculation level if there is a predecessor whose speculation level is higher than PU that issued store Stage 3 Stage 4 Stage 5 Stage 6 g = 2 p = 4 F = k / 243 f = * k 1/7 P = 18 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 7.16 FO4 Cache Directory access time = 11.4 FO4 Total Delay = 18.6 FO4 Figure 19 : Violation Detection Check if the PU loaded the data in the thread No data can be forwarded To L2 Cache kc Stage 7 p = 2 Cout = kc Load bit Violation! To TCU In this paper, we perform a complexity analysis of a cache controller designed by extending an MSI controller to support thread-level memory speculation. We model and estimate the delay of logic on critical paths and additional area overhead to hold additional control bits in cache directory. We found that the increase in area of cache directory over the original MSI cache directory is significant. This is a consequence of the requirements to maintain speculative state of memory in per-word basis rather then in per-line basis. For many protocol operations, area overhead occupies more than half of the total delay. This area overhead is however smaller than the delay for accessing and comparing cache tags. Since cache directory access and protocol logic operation can be performed in parallel with cache tag access, significant increase in critical path kc

14 Stage 1 g = 1 p = 1 Cin = C Cache Directory Load Stage 2 g = 3 p = 4 Stage 3 Stage 4 g = 7/3 p = 5 Stage 5 g = 7/3 Stage 6 g = 7/3 p = 5 Stage 7 p = 2 Select In MUX F = k / 2430 f = * k 1/8 P = 29 D = * k 1/ FO4 Delay = 5 if k = 1, D = / 5 = 9.8 FO4 Cache directory access time = 11.4 FO4 Total Delay =21.2 FO4 Figure 20 : State Transition Stage 8 g = 2 p = 6 Cout = kc To Directory delay can be avoided. Finally, the results also indicate that the cache controller can be implemented with reasonable speed (cycle time less than 20 [FO4]), although in such a case, some operations must be divided into several cycles. References [1] Lance Hammond, Benedict A.Hubbert, Michael Siu, Manohar K.Prabhu, Michael Chen and Kunle Olukotun, The Stanford Hydra CMP, IEEE MICRO Magazine, March-April 2000, and presented at Hot Chips 11, August [2] Lance Hammond, Mark Willey and Kunle Olukotun, Speculation Support for a Chip Multiprocessor, Proceedings Of The 8th International Symposium on Architectural Support for Parallel Languages and Operating Systems (AS- PLOS), [3] J. Gregory Steffan, Christopher B. Colohan and Todd C. Mowry, Extending Cache Coherence to Support Thread-Level Speculation on a Single Chip and Beyond, Technical Report CMU-CS , School of Computer Science, Carnegie Mellon University, December [4] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai and Todd C. Mowry, A Scalable Approach to Thread-Level Speculation, Proceedings Of The 27th International Symposium on Computer Architecture (ISCA), kc [5] M. Franklin and G. S. Sohi. ARB: A Hardware Mechanism for Dynamic Reordering of Memory References, IEEE Transactions on Computers, 45(5): , [6] Gurindar S.Sohi, Scortt E.Breach and T.N.Vijaykumar, Multiscalar Processor, 22th International Symposium on Computer Architecture (ISCA-22), [7] Venkata Krishnan and Josep Torrellas, Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip- Multiprocessor, International Conference on Supercomputing (ICS), July [8] Venkata Krishnan and Josep Torrellas, A Chip- Multiprocessor Architecture with Speculative Multithreading, IEEE Transactions on Computers, VOL.48, NO.9, [9] Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja and Pen-Chung Yew, A Chip- Multiprocessor Architecture with Speculative Multithreading, IEEE Transactions on Computers, VOL.48, NO.9, [10] Sridhar Gopal, T. N. Vijaykumar, James E. Smith and Gurindar S. Sohi, Speculative Versioning Cache, Proceedings Of The 4th International Symposium on High-Performance Computer Architecture (HPCA), [11] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz, The Future of Wires, Proceedings Of The Ieee, VOL.89, NO.4, April [12] I. Sutherland, B. Sproull, and D. Harris, Logical Effort, San Francisco:Morgan Kaufmann, [13] S. Wilton and N. Jouppi, An enhanced access and cycle time model for on-chip caches, Technical Report 93/5, Digital Western Research Laboratory, July 1994.

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

A Hardware/Software Approach for Thread Level Control Speculation

A Hardware/Software Approach for Thread Level Control Speculation A Hardware/Software Approach for Level Control Speculation Luong inh Hung, Hideyuki Miura, Chitaka Iwama, aisuke Tashiro, Niko emus Barli, Shuichi Sakai and Hidehiko Tanaka Speculative multithreading is

More information

CS533: Speculative Parallelization (Thread-Level Speculation)

CS533: Speculative Parallelization (Thread-Level Speculation) CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts

More information

A Chip-Multiprocessor Architecture with Speculative Multithreading

A Chip-Multiprocessor Architecture with Speculative Multithreading 866 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 A Chip-Multiprocessor Architecture with Speculative Multithreading Venkata Krishnan, Member, IEEE, and Josep Torrellas AbstractÐMuch emphasis

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

A Thread Partitioning Algorithm using Structural Analysis

A Thread Partitioning Algorithm using Structural Analysis A Thread Partitioning Algorithm using Structural Analysis Niko Demus Barli, Hiroshi Mine, Shuichi Sakai and Hidehiko Tanaka Speculative Multithreading has been proposed as a method to increase performance

More information

Speculative Versioning Cache: Unifying Speculation and Coherence

Speculative Versioning Cache: Unifying Speculation and Coherence Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical

More information

Data Speculation Support for a Chip Multiprocessor

Data Speculation Support for a Chip Multiprocessor Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract

More information

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

The Design Complexity of Program Undo Support in a General-Purpose Processor

The Design Complexity of Program Undo Support in a General-Purpose Processor The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Speculative Versioning Cache

Speculative Versioning Cache Speculative Versioning Cache Sridhar Gopal T.N.Vijaykumar James E. Smith Gurindar S. Sohi gsri@cs.wisc.edu vijay@ecn.purdue.edu jes@ece.wisc.edu sohi@cs.wisc.edu Computer Sciences School of Electrical

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CMP Support for Large and Dependent Speculative Threads

CMP Support for Large and Dependent Speculative Threads TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastassia Ailamaki, J. Gregory Steffan, Member, IEEE, and Todd C. Mowry

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

The STAMPede Approach to Thread-Level Speculation

The STAMPede Approach to Thread-Level Speculation The STAMPede Approach to Thread-Level Speculation J. GREGORY STEFFAN University of Toronto CHRISTOPHER COLOHAN Carnegie Mellon University ANTONIA ZHAI University of Minnesota and TODD C. MOWRY Carnegie

More information

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors Λ

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors Λ Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors Λ Marcelo Cintra, José F. Martínez, and Josep Torrellas Department of Computer Science University of Illinois

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Computer Science Carnegie Mellon DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Extending Cache Coherence to Support Thread-Level Data Speculation on a Single Chip and Beyond

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors

Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors MARíA JESÚS GARZARÁN University of Illinois at Urbana-Champaign MILOS PRVULOVIC Georgia Institute of Technology

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

NOW that the microprocessor industry has shifted its

NOW that the microprocessor industry has shifted its IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastasia Ailamaki, Member, IEEE Computer

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

THE STANFORD HYDRA CMP

THE STANFORD HYDRA CMP THE STANFORD HYDRA CMP CHIP MULTIPROCESSORS OFFER AN ECONOMICAL, SCALABLE ARCHITECTURE FOR FUTURE MICROPROCESSORS. THREAD-LEVEL SPECULATION SUPPORT ALLOWS THEM TO SPEED UP PAST SOFTWARE. Lance Hammond

More information

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization

More information

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh http://www.dcs.ed.ac.uk/home/mc

More information

Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors

Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors Wshp. on Memory Performance Issues, Intl. Symp. on Computer Architecture, June 2001. Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors José F. Martínez and

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS 351 Final Exam Solutions

CS 351 Final Exam Solutions CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Christopher B. Colohan, Anastassia Ailamaki, J. Gregory Steffan, and Todd C. Mowry School of Computer Science Carnegie Mellon University

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures Homework 6 BTW, This is your last homework 5.1.1-5.1.3 5.2.1-5.2.2 5.3.1-5.3.5 5.4.1-5.4.2 5.6.1-5.6.5 5.12.1 Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23 1 CSCI 402: Computer

More information

Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Cache Coherence Protocols Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1> Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

15-740/ Computer Architecture

15-740/ Computer Architecture 15-740/18-740 Computer Architecture Lecture 16: Runahead and OoO Wrap-Up Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/17/2011 Review Set 9 Due this Wednesday (October 19) Wilkes, Slave Memories

More information

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points.

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 06: Instruction Pipelining and Parallel Processing Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004

114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism Pedro Marcuello, Antonio González, Member,

More information

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures.

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures. Compiler Techniques for Energy Saving in Instruction Caches of Speculative Parallel Microarchitectures Seon Wook Kim Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 5: Precise Exceptions Prof. Onur Mutlu Carnegie Mellon University Last Time Performance Metrics Amdahl s Law Single-cycle, multi-cycle machines Pipelining Stalls

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information