Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Bo Hong Drexel University, Philadelphia, PA 19104 bohong@coe.drexel.edu Abstract A new hardware-based design is presented to support shared accesses in multi-core processors. In the proposed design, instructions updating shared variables are issued by the processor cores but executed by the proposed hardware unit that is snooping on the bus. After issuing such an instruction, the processor core can proceed immediately with subsequent instructions. The proposed hardware unit assumes the responsibility of buffering and sequentializing the multiple access requests from the processor cores. Design details of the hardware are discussed in this paper including instruction sequentialization, read operations, and replacement policies. The proposed hardware is shown to be sequentially consistent for accesses to shared variables. Comparison against lock-based synchronization shows that the proposed hardware design can significantly reduce synchronization overheads. 1 Introduction Multi-core architectures have recently become the focus of intensive research due to the growing interest on thread-level parallelism. This is a natural response to the diminishing performance improvement in exploring instruction-level parallelism. Given the current trend towards multi-core architectures, the extent to which an application can be multi-threaded to keep the multiple processor cores busy is likely to be one of the greatest constraints on the performance of next generation computing platforms. However, except for the embarrassingly parallel workload where no particular effort is needed to segment the problem into a very large number of independent tasks, it is often very challenging for multi-threading to achieve efficiency due to the intrinsic data or control dependencies in the applications. The level and functionality of hardware support has a major impact on how multi-threading needs to be implemented. The existing methodology is to permit maximum concurrent accesses to shared memory locations (or shared cache) while maintaining data consistency through cache coherency protocols. The hardware provides atomic operations [1] that can be used to implement synchronization primitives like locks and barriers [2]. Multi-threading under such a model requires programmers to manage thread advancement using barriers and protect critical regions using locks [3]. Such synchronization mechanism is absolutely necessary for thread control but at the same time causes costly overheads because the desired synchronization behavior is achieved by stalling (a subset of) the threads waiting on locks. Even worse, the stalled threads continuously check the status of the lock, thus consuming precious bus bandwidth and further lengthening lock acquisition time as the number of processors increases [2]. This paper presents a new hardware design that aims at reducing such synchronization overheads. The new design is based on the observation that shared accesses are often incremental updates, which may take various forms such as consecutive additions or multiplications, or consecutive insertion and extraction operations on a queue. For such incremental updates, the presented new hardware design is able to release the processor cores from acquiring exclusive ownerships before updating shared data. The update requests are issued by the processors but executed by the proposed hardware, which buffers and sequentializes the update requests, thus providing both atomicity and sequential consistency for shared accesses. Immediately after posting their update requests, the processor cores are able to continue with the execution of subsequent instruction streams. In summary, the presented hardware offloads the synchronization overheads as found in softwarebased locking mechanism. Comparison against lockbased methods shows that the proposed hardware leads to improved concurrency in multi-threading because the synchronization overhead is greatly reduced. The rest of the paper is organized as follows. Sec- 1

tion 2 briefly reviews related work. Section 3 describes the proposed hardware. Section 4 discusses the consistency property of the proposed hardware. Performance analysis is presented in Section 5. Discussions and future research directions is presented in Section 6. 2 Related Work Multi-core Architectures. Multi-core processors are steadily gaining their significance and popularity in computing [4]. A multi-core microprocessor combines two or more independent CPU cores into a single package. Each core is independently implemented with optimizations such as pipelining, super-scalar, out-of-order execution, and multi-threading. Cores in a multi-core may share a single coherent cache at the highest on-chip level (e.g. L2 for Intel Core 2) or may have separate caches (e.g. current AMD dual-core processors). The cores share the same interconnect to the rest of the system, primarily to the shared memory and I/O. The close proximity of multiple CPU cores on the same die has the advantage of allowing the cache coherency circuitry to operate at a much higher clock rate than is possible if the signals have to travel off-chip as in the case of traditional SMP systems [5]. Cache coherence is the key to parallel processing in multi-core architectures [6]. The coherency problem emerges when multiple processors access the shared memory through their private caches, one processor may see a new value in its cache while others still see the old value. Cache coherency is enforced through cache coherency protocols that either invalidate or update data in the caches [2]. The two most common types of protocols are the snooping-based protocols and the directorybased protocols. While cache coherency allows multiple threads to access shared data, it is the synchronization mechanism that ensures the correct execution of concurrent programs. The most frequently used mechanism is a lock [3], which is typically implemented in hardware in the form of one or more atomic instructions such as testand-set and load-locked store-conditional. These instructions allow a single process/thread to test if the lock is free, and if free, acquire the lock in a single atomic operation. Other synchronization mechanisms, such as barriers, can be implemented in software on top of locks [2]. Locks have overheads. A processor needs to allocate memory space for locks and spend CPU cycles to initiate, acquire, release, and destroy locks [2]. When one processor or thread attempts to acquire a lock held by another processor or thread, either busy-waiting or context-switching is needed [7]. In case switching to another computing task is either impossible or too expensive, busy-waiting has to be used but it will lead to wasted cycles. Furthermore, the waiting threads continuously check the status of the lock, thus consuming precious bus bandwidth and further lengthening lock acquisition time as the number of processors increases [2]. The proposed design constitutes an overhaul of the traditional locking mechanism. The proposed new hardware design automatically sequentializes the complex operations on shared memory locations, thus eliminating thread stalls caused by synchronizations. Transactional Memory. Transactional memory is a concurrency control mechanism for controlling accesses to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization and is typically implemented in a lock-free way. A transaction in this context is a piece of code that executes a series of reads and writes to shared memory locations. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions. The idea of providing hardware support for transactions originated in [8]. Software-only transactional memory has recently been the focus of intense research and support for practical implementations is growing [9]. In transactional memory, every thread completes its modifications to shared memory without regard for what other threads might be doing, recording every read and write that it makes in a log. After completing an entire transaction, a thread verifies that other threads have not concurrently made changes to memory that it accessed in the past. This next operation, in which the changes of a transaction are validated and, if validation is successful, made permanent, is called a commit. A transaction may also abort at any time, causing all of its prior changes to be rolled back or undone. If a transaction cannot be committed due to conflicting changes, it is typically aborted and re-executed from the beginning until it succeeds [10]. The proposed design significantly differs from transactional memory in that the proposed design is based on hardware sequentialization rather than (software) transactions that would require considerable resources for logging and possible rolling-backs. 3 Hardware Design of the Accumulative Execution Unit The proposed design focuses on the set of shared variable accesses where the same associative mathematical operations are consecutively applied to a shared variable. Such operations are denoted as accumulative computations in this paper. Both the Σ and the Π notations, Σ i=n i=1 a i and Π j=n j=1 b j, are examples of accumula- 2

tive computations. Logical AND and OR operations are also accumulative computations. Queue insertions and extractions can also be considered as generalized accumulative operations. Such accumulative computations represent a large portion of shared variable usages. Example applications of accumulative computations include shared counters, shared task queues, and process lists in multi-processor operating systems, etc. It is expected that by supporting accumulative computations, the proposed can improve the concurrency of a wide range of multi-threaded applications. The proposed hardware is named as the Accumulative Execution Unit (). There are four categories of accumulative instructions as illustrated in Figure 1: read only, write only, compute only, and compute and read. These are illustrated in Figure 1. request for `x P1 `x `x +1 Read only Compute only return `x write to `x Write only Pn P1 Pn a `x `x +1 Compute and read P1 Pn P1 Pn c b return `x d Figure 1. The four categories of accumulative instructions. When executing these instructions, the memory is accessed only if necessary (when the does not have the requested variable or when replacement is needed) as illustrated by dotted lines marked a, b, c, and d. A read only instruction is similar to normal memory reads except that the target address is a shared variable. Similarly, a write only instruction writes to a shared variable. To differentiate accumulative read and normal read instructions, the instruction set architecture needs to be augmented. For notational purposes, an accumulative read only instruction is denoted as loadx and a normal read instruction is denoted as load. For write operations, the notations of storex and store are used, respectively. A compute only instruction is used when the issuing processor does not need the computation result. For example, an instruction may specify add 1 to shared variable x. This is denoted as addx x, 1. When executing this instruction, the updates x but does not return the updated x to the processor. This instruction is different from the normal add instruction add x, 1 as in the 80x86 assembly. Executing add x, 1 involves loading memory location x into the processor core, executing the addition inside the processor core, and finally writing back the result to x. The execution of addx x, 1, however, is entirely outside the processor core and does not involve a write back operation. The value of x is maintained by the and is written back to the memory only when replacement occurs. (Details of replacement will be discussed later in this section.) A compute and read instruction performs computation and returns the computation result to the processor. For example, an instruction may specify add 1 to shared variable x and return the result in local register R1. This is denoted as addr R1, x, 1. To ensure atomic execution of the instruction, the completes the entire instruction (both the computation and the read parts) before executing another instruction. Different from the read-only instruction addx x, 1, the compute-andread addr R1, x, 1 needs to return the result to the processor core after the addition. Figure 2 illustrates the proposed design. When a processor core encounters an accumulative instruction, the instruction is issued onto the bus for the to capture. By utilizing its FIFO instruction queue, the retains the order in which the instructions appeared on the bus. Encoded in each instruction are the requested computation and the address of the shared variable. The memory address is used to identify a storage register that is holding the current value of the shared variable, or to allocate a storage register if this is a new shared variable. This storage register and, if present, the immediate number encoded in the instruction are taken as inputs by one of the functional units for the execution of the instruction. The write unit is dedicated to the execution of write-only instructions. It does not need input from any storage registers, instead, it writes to a target storage register. To allow concurrency when executing accumulative instructions, reservation stations are used to buffer instructions for the functional units, which is similar to the Tomasulo design [1]. With reservation stations, if an instruction in the queue needs the result that is still being computed by a functional unit, the instruction can be issued, which will be waiting for the result at a reservation station instead of blocking the entire issuing process. Outputs from the functional units are written to the storage registers and, if necessary, forwarded to reservation stations. Due to the use of reservation stations, instructions may be executed out of order. For shared accesses, however, the order of execution is extremely important to 3

P1 P2 P3 P4 queue Completion buffer completion Status Inst. ID Data for read op. Output buffer Destination Data Read operation reaches bottom of buffer? issue Memory data Write results Storage registers Memory address Status Reference statistics Data Mem Replacement Reservation stations Functional units Write unit INT adders FP adders Read operands........ More units Figure 2. The proposed Accumulative Execution Unit (). The illustrative hypothetical architecture is a quad-core processor with shared memory. the correctness of the result. To maintain in-order instruction completion in face of out-of-order execution, a FIFO completion buffer is introduced to track the execution status of all the issued instructions. Whenever an instruction is issued to a reservation station, it is also dispatched to the FIFO completion buffer with an initial status of not-completed. The instruction status in constantly updated whenever a functional unit completes an instruction. Once the instruction at the bottom (assuming insertion occurs at the top of the buffer) changes to completed, it will be cleared out of the buffer and the completion buffer shifts downwards by one entry. Internally at the functional units, the instructions are executed out of order, but if the processor cores attempt to read the value of a shared variable, whether it is readonly or compute-and-read, the read operation will not be performed until the instruction reaches the bottom of the completion buffer, i.e. after all the previous instructions have completed. Thus from the processors point of view, instructions are completed in program order, although the actual execution is out-of-order (for improved concurrency). Note that the data for a read operation may become available before the read reaches the bottom of the completion buffer, in which case the data will be saved in the data for read op. field of the completion buffer. When replacement is considered, instruction issue, execution, and completion still follow the same procedure but a few more complicated details are involved. The details of replacement are presented later in this section. Sequentialization. As illustrated in Figure 2, the has a single interface snooping the bus for accumulative instructions. Because the bus is a shared media that can accommodate only one request at a time, accumulative instructions are automatically sequentialized in the order that they appear on the bus. The order is then preserved by the as it tracks instruction completion status and only allows the processor cores to read shared variables in the order that the instructions were issued. More importantly, the access order is obtained without using locks to control thread advancement at any of the processor cores. Support for Read Operations. A read operation can be either a read only instruction or the read component in a compute and read instruction. In either case, the read operation will be executed (i.e. sending data to the processor) when it reaches the bottom of the completion buffer. The following two scenarios may occur for read 4

operations. 1. The shared variable is not present in a storage register. This may occur in two scenarios: (a) this is the first ever access to the shared variable so it has never been retrieved by the before, or (b) the shared variable was in the but was replaced. It was written back to the memory and hence is absent in the. For read only instructions, upon detecting the absence of the requested variable, the forwards the read request to the memory via the bus. The forwarded request is of the same format as a normal memory read request by the processors. The response from the memory will be captured by both the requesting processor and the. For the processor, this completes the read. For the, capturing the response triggers the allocation of a storage register for the requested variable so that future accesses to this variable can be handled within the. For compute and read instructions, the first sends a read request to the memory to load the shared variable into the, replacing an existing storage register if necessary. The then performs the requested computation and sends the result to the requesting processor core via the bus. 2. The shared variable is present in the. In this scenario, the read operation is performed when the read operation reaches the bottom of the completion buffer, which occurs after all preceding instructions have been completed and cleared out of the completion buffer. If it is a read only instruction, the value in the data for read op. field will be sent onto the bus; if it is a compute and read instruction, the compute component may be executed early, but the result is sent to the processor only after the instruction reaches the bottom of the completion buffer. Replacement. Replacement is initiated during the issue stage when an instruction needs to access a new shared variable but all the storage registers have been allocated to other shared variables. Since the shared variable is not in any of the storage registers, the sends a memory request to retrieve the shared variable from the memory. At the same time, the issue process will be stalled until a storage register becomes available. First, the replacement policy will be used to select a storage register, which is of course currently allocated to another shared variable. Depending on whether there are pending instructions at the reservation stations that need to update the selected storage register, contents in the storage register will be sent to the output buffer either immediately or after those instructions complete. As soon as the selected storage register is written to the output buffer, the issue process resumes, which will send the stalling instruction to the completion buffer as well as a proper reservation station. In the mean time, the memory address field of the selected storage register will be changed to the new shared variable. The instruction execution will start once the memory responses with the request data, and, of course, if a functional unit is available. To select a storage register for replacement, there can be a series of polices to choose from, like in the case of cache replacement. Possible policies include Least Recently Used (LRU), Random, First-In First-Out (FIFO), Pseudo LRU. In addition to these choices, the Least Outstanding s (LOI) policy is designed according to the buffering feature of the. LOI selects a storage register that has the least pending instructions. The LOI policy minimizes the time required to clear the storage register so that new instructions can dispatched as early as possible. At the time of this paper submission, a simulator is still being developed to evaluate design parameters. Quantitative study of the replacement policy will be reported in a subsequent paper with considerations in throughput, response time, and the avoidance of oscillation (where a particular shared variable is replaced again and again). 4 Consistency Property of The preserves sequential consistency for accesses to shared variables. Formally, sequential consistency is the property of a multi-processor system such that, for the execution of any program, an equivalent result can be obtained by hypothetically sequentializing all the shared variable accesses. In the hypothetical order, accesses by any particular processor retain the order that they were issued in the execution. [2] In the design, accesses to each shared variable take the form of accumulative computations, which are physically sequentialized as they are issued to the via the bus. Such a sequential order is further retained by the instruction queue where instruction completion is monitored by the completion buffer. Consequently, accesses to all the shared variable will be completed in the order in which they were issued onto the bus (though the execution may be performed out of order for performance considerations). Such an order satisfies the requirement of sequential consistency and thus the design preserves the property of sequential consistency. 5

5 Performance Evaluation In this section, the performance of proposed design is compared against the conventional lock-based mechanism, using the application of shared task queue as illustrated in Figure 3. Processor 1 Processor 2..... Processor n design can significantly reduce the enqueue and dequeue overhead, and possibly eliminate it if assuming the processor supports out-of-order execution. The pseudo code is listed below: while ( head tail){ addr R1,head,1; //increase head atomically execute task; task=dequeue R1; //get task for next iteration } Dequeue (execute tasks) Task queue (shared) Enqueue (generate tasks) List 1: Pseudo code for accessing a shared task queue with. addr is an accumulative instruction. head and tail are the shared pointers to queue. R1 is a register inside the processor core. Queue head pointer (shared) Queue tail pointer (shared) Figure 3. The sample application of a shared task queue. Tasks are generated on the fly, which are inserted into the queue then extracted for execution. The task queue is shared, so are the head and tail pointers. To implement a task queue using the conventional lock-based mechanism, a lock is needed to protect the queue head pointer and another lock is needed for the queue tail pointer. Task extraction can be performed only after a successful acquisition of the lock, after which the processor needs to advance the head pointer and releases the lock. Similarly, task insertion also needs to be protected by locks. The overhead of locking is in the acquisition and releasing process. For example, with test-and-set locks, lock acquisition generates excessive bus traffic: each time a test-and-set instruction is executed to check whether the lock is free, it writes to the cache block that contains the lock variable, which will result in a bus transaction invalidating the block cached by another processor (which wrote to the block when executing its test-and-test ). Therefore, processors generate bus transactions repeatedly while waiting for the lock to become free. Such contention consumes the bus bandwidth and slows down the lock acquisition time as the number of processors increases. Previous studies have shown that the acquisition time of test-and-set locks increases almost linearly with the number of processors. Other improved lock implementations (such as the ticket lock and the load-locked/store-conditional lock) do provide shorter acquisition time, but still exhibit close-tolinear increase as the number of processors increases [2]. Compared with lock-based schemes, the proposed At the beginning of each iteration is an accumulative addr instruction which will atomically increase the queue head pointer to the next queue entry, thus preparing the dequeue operation for the next iteration. The processor then proceeds with the execution of the current task. Upon completion of the task, the processor loads the task for the next iteration using the task id that was just updated by the addr instruction. The code above only generates two bus transactions each time when the queue head pointer is updated: first an addr request to the and later on a response from the, which is the minimum required number to update a shared variable. Processors do not compete for the accesses to the shared queue head pointer as the sequentializes all the requests and executes each one of them atomically. If the processor supports out-of-order execution, then the execution of current task can start without waiting for the first addr instruction to complete. As long as the sends back the updated head pointer before the completion of current task, the dequeue operation at the bottom of each iteration can start without any delay. This way, the overhead of accessing the shared queue head pointer is completely hidden behind the execution of the task. In case the processor does not support out-of-order execution, the overhead of dequeuing will be the cost of the two aforementioned bus transactions, which is still significantly lower than the lock acquisition time. In the proposed method, software pipelining is used to executes a task while waiting for the s response. This technique cannot be used to effectively reduce the overhead in lock-based schemes because whenever a lock is used, whether it is at the beginning or the end of an iteration, multiple processors will compete for the lock, resulting in multiple bus transactions and certain 6

amount of acquisition time, which unfortunately also increases with the number of processors. In summary, the proposed design outperforms the conventional lock based synchronization methods with its capability of eliminating synchronization overheads. Note that the performance superiority is obtained for the class of applications that can be implemented using accumulative instructions. The identification of such applications will be part of future research directions. 6 Discussions This paper presents the overall logical design that enables the functionality of the. It has been demonstrated that the has the potential of overhauling synchronization mechanism in multi-core architectures. As part of an on-going research effort in improving concurrency for multi-core architectures, the presented work is by no means the conclusion of the design. To fully explore the potential of the, investigations are being conducted in the following three directions. First, design details and trade-offs of the hardware need to be investigated. Currently, a cycle-accurate performance multi-core architecture simulator is being developed to support the design. The simulator will provide quantitative insight into the design, thus help answering a wide range of design questions. For example, What is the optimal number of storage registers?, Is the optimality of replacement policy application specific?, or Compared with lock-based scheme, what is the exact amount of reduction in the synchronization overhead? Because answers to these questions are generally application dependent, the gcc compiler is being ported to support the design so that benchmark programs can be used to in the investigation. Second, the presented design needs to be updated to support more relaxed consistency models. Instead of implementing a single sequential order for all the shared accesses, the can support partial orders or individual orders for each shared variable. Relaxed consistency models are in general more difficult to program but typically leads to faster execution time. These variations of consistency models need to be supported so that further design trade-off can be explored. Third, algorithmic studies need to be conducted to identify the class of accumulative instructions that can be supported by the, and to developed new multithreaded algorithms to accelerate a wide range of applications. The example of shared task queue demonstrates the superiority of the design over lockbased schemes. More complicated problems need to be studied to substantiate the general applicability the design. Some examples of the problems to be solved include graph problems (spanning tree, shortest path, etc.), achieving consensus among multiple threads (important for determining the termination of multi-threaded algorithms), etc. These problems have been studied for existing architectures, the proposed investigation will focus on the development of new algorithms that can avoid/reduce synchronization overheads with the help of the. References [1] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition. Morgan Kaufmann Publishers, 2007. [2] D. Culler, J. P. Singh, and A. Gupta., Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998. [3] A. S. Tanenbaum, Modern Operating Systems, 2nd Edition. Prentice Hall, 2001. [4] W. Wolf, The future of multiprocessor systemson-chips, in DAC 04: Proceedings of the 41st annual conference on Design automation. ACM Press, 2004, pp. 681 685. [5] L. Spracklen and S. G. Abraham, Chip multithreading: Opportunities and challenges, in HPCA 05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 2005, pp. 248 252. [6] P. Stenstrom, A survey of cache coherence schemes for multiprocessors, Computer, vol. 23, no. 6, pp. 12 24, 1990. [7] B.-H. Lim and A. Agarwal, Waiting algorithms for synchronization in large-scale multiprocessors, ACM Transactions on Computer System, vol. 11, no. 3, pp. 253 294, 1993. [8] T. Knight, An architecture for mostly functional languages, in LFP 86: Proceedings of the 1986 ACM conference on LISP and functional programming. ACM Press, 1986, pp. 105 112. [9] N. Shavit and D. Touitou, Software transactional memory, Journal of Distributed Computing, vol. 10, no. 2, pp. 99 116, 1997. [10] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie, Unbounded transactional memory, IEEE Micro, vol. 26, no. 1, pp. 59 69, 2006. 7