Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures

Size: px
Start display at page:

Download "Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures"

Transcription

1 Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Bo Hong Drexel University, Philadelphia, PA Abstract A new hardware-based design is presented to support shared accesses in multi-core processors. In the proposed design, instructions updating shared variables are issued by the processor cores but executed by the proposed hardware unit that is snooping on the bus. After issuing such an instruction, the processor core can proceed immediately with subsequent instructions. The proposed hardware unit assumes the responsibility of buffering and sequentializing the multiple access requests from the processor cores. Design details of the hardware are discussed in this paper including instruction sequentialization, read operations, and replacement policies. The proposed hardware is shown to be sequentially consistent for accesses to shared variables. Comparison against lock-based synchronization shows that the proposed hardware design can significantly reduce synchronization overheads. 1 Introduction Multi-core architectures have recently become the focus of intensive research due to the growing interest on thread-level parallelism. This is a natural response to the diminishing performance improvement in exploring instruction-level parallelism. Given the current trend towards multi-core architectures, the extent to which an application can be multi-threaded to keep the multiple processor cores busy is likely to be one of the greatest constraints on the performance of next generation computing platforms. However, except for the embarrassingly parallel workload where no particular effort is needed to segment the problem into a very large number of independent tasks, it is often very challenging for multi-threading to achieve efficiency due to the intrinsic data or control dependencies in the applications. The level and functionality of hardware support has a major impact on how multi-threading needs to be implemented. The existing methodology is to permit maximum concurrent accesses to shared memory locations (or shared cache) while maintaining data consistency through cache coherency protocols. The hardware provides atomic operations [1] that can be used to implement synchronization primitives like locks and barriers [2]. Multi-threading under such a model requires programmers to manage thread advancement using barriers and protect critical regions using locks [3]. Such synchronization mechanism is absolutely necessary for thread control but at the same time causes costly overheads because the desired synchronization behavior is achieved by stalling (a subset of) the threads waiting on locks. Even worse, the stalled threads continuously check the status of the lock, thus consuming precious bus bandwidth and further lengthening lock acquisition time as the number of processors increases [2]. This paper presents a new hardware design that aims at reducing such synchronization overheads. The new design is based on the observation that shared accesses are often incremental updates, which may take various forms such as consecutive additions or multiplications, or consecutive insertion and extraction operations on a queue. For such incremental updates, the presented new hardware design is able to release the processor cores from acquiring exclusive ownerships before updating shared data. The update requests are issued by the processors but executed by the proposed hardware, which buffers and sequentializes the update requests, thus providing both atomicity and sequential consistency for shared accesses. Immediately after posting their update requests, the processor cores are able to continue with the execution of subsequent instruction streams. In summary, the presented hardware offloads the synchronization overheads as found in softwarebased locking mechanism. Comparison against lockbased methods shows that the proposed hardware leads to improved concurrency in multi-threading because the synchronization overhead is greatly reduced. The rest of the paper is organized as follows. Sec- 1

2 tion 2 briefly reviews related work. Section 3 describes the proposed hardware. Section 4 discusses the consistency property of the proposed hardware. Performance analysis is presented in Section 5. Discussions and future research directions is presented in Section 6. 2 Related Work Multi-core Architectures. Multi-core processors are steadily gaining their significance and popularity in computing [4]. A multi-core microprocessor combines two or more independent CPU cores into a single package. Each core is independently implemented with optimizations such as pipelining, super-scalar, out-of-order execution, and multi-threading. Cores in a multi-core may share a single coherent cache at the highest on-chip level (e.g. L2 for Intel Core 2) or may have separate caches (e.g. current AMD dual-core processors). The cores share the same interconnect to the rest of the system, primarily to the shared memory and I/O. The close proximity of multiple CPU cores on the same die has the advantage of allowing the cache coherency circuitry to operate at a much higher clock rate than is possible if the signals have to travel off-chip as in the case of traditional SMP systems [5]. Cache coherence is the key to parallel processing in multi-core architectures [6]. The coherency problem emerges when multiple processors access the shared memory through their private caches, one processor may see a new value in its cache while others still see the old value. Cache coherency is enforced through cache coherency protocols that either invalidate or update data in the caches [2]. The two most common types of protocols are the snooping-based protocols and the directorybased protocols. While cache coherency allows multiple threads to access shared data, it is the synchronization mechanism that ensures the correct execution of concurrent programs. The most frequently used mechanism is a lock [3], which is typically implemented in hardware in the form of one or more atomic instructions such as testand-set and load-locked store-conditional. These instructions allow a single process/thread to test if the lock is free, and if free, acquire the lock in a single atomic operation. Other synchronization mechanisms, such as barriers, can be implemented in software on top of locks [2]. Locks have overheads. A processor needs to allocate memory space for locks and spend CPU cycles to initiate, acquire, release, and destroy locks [2]. When one processor or thread attempts to acquire a lock held by another processor or thread, either busy-waiting or context-switching is needed [7]. In case switching to another computing task is either impossible or too expensive, busy-waiting has to be used but it will lead to wasted cycles. Furthermore, the waiting threads continuously check the status of the lock, thus consuming precious bus bandwidth and further lengthening lock acquisition time as the number of processors increases [2]. The proposed design constitutes an overhaul of the traditional locking mechanism. The proposed new hardware design automatically sequentializes the complex operations on shared memory locations, thus eliminating thread stalls caused by synchronizations. Transactional Memory. Transactional memory is a concurrency control mechanism for controlling accesses to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization and is typically implemented in a lock-free way. A transaction in this context is a piece of code that executes a series of reads and writes to shared memory locations. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions. The idea of providing hardware support for transactions originated in [8]. Software-only transactional memory has recently been the focus of intense research and support for practical implementations is growing [9]. In transactional memory, every thread completes its modifications to shared memory without regard for what other threads might be doing, recording every read and write that it makes in a log. After completing an entire transaction, a thread verifies that other threads have not concurrently made changes to memory that it accessed in the past. This next operation, in which the changes of a transaction are validated and, if validation is successful, made permanent, is called a commit. A transaction may also abort at any time, causing all of its prior changes to be rolled back or undone. If a transaction cannot be committed due to conflicting changes, it is typically aborted and re-executed from the beginning until it succeeds [10]. The proposed design significantly differs from transactional memory in that the proposed design is based on hardware sequentialization rather than (software) transactions that would require considerable resources for logging and possible rolling-backs. 3 Hardware Design of the Accumulative Execution Unit The proposed design focuses on the set of shared variable accesses where the same associative mathematical operations are consecutively applied to a shared variable. Such operations are denoted as accumulative computations in this paper. Both the Σ and the Π notations, Σ i=n i=1 a i and Π j=n j=1 b j, are examples of accumula- 2

3 tive computations. Logical AND and OR operations are also accumulative computations. Queue insertions and extractions can also be considered as generalized accumulative operations. Such accumulative computations represent a large portion of shared variable usages. Example applications of accumulative computations include shared counters, shared task queues, and process lists in multi-processor operating systems, etc. It is expected that by supporting accumulative computations, the proposed can improve the concurrency of a wide range of multi-threaded applications. The proposed hardware is named as the Accumulative Execution Unit (). There are four categories of accumulative instructions as illustrated in Figure 1: read only, write only, compute only, and compute and read. These are illustrated in Figure 1. request for `x P1 `x `x +1 Read only Compute only return `x write to `x Write only Pn P1 Pn a `x `x +1 Compute and read P1 Pn P1 Pn c b return `x d Figure 1. The four categories of accumulative instructions. When executing these instructions, the memory is accessed only if necessary (when the does not have the requested variable or when replacement is needed) as illustrated by dotted lines marked a, b, c, and d. A read only instruction is similar to normal memory reads except that the target address is a shared variable. Similarly, a write only instruction writes to a shared variable. To differentiate accumulative read and normal read instructions, the instruction set architecture needs to be augmented. For notational purposes, an accumulative read only instruction is denoted as loadx and a normal read instruction is denoted as load. For write operations, the notations of storex and store are used, respectively. A compute only instruction is used when the issuing processor does not need the computation result. For example, an instruction may specify add 1 to shared variable x. This is denoted as addx x, 1. When executing this instruction, the updates x but does not return the updated x to the processor. This instruction is different from the normal add instruction add x, 1 as in the 80x86 assembly. Executing add x, 1 involves loading memory location x into the processor core, executing the addition inside the processor core, and finally writing back the result to x. The execution of addx x, 1, however, is entirely outside the processor core and does not involve a write back operation. The value of x is maintained by the and is written back to the memory only when replacement occurs. (Details of replacement will be discussed later in this section.) A compute and read instruction performs computation and returns the computation result to the processor. For example, an instruction may specify add 1 to shared variable x and return the result in local register R1. This is denoted as addr R1, x, 1. To ensure atomic execution of the instruction, the completes the entire instruction (both the computation and the read parts) before executing another instruction. Different from the read-only instruction addx x, 1, the compute-andread addr R1, x, 1 needs to return the result to the processor core after the addition. Figure 2 illustrates the proposed design. When a processor core encounters an accumulative instruction, the instruction is issued onto the bus for the to capture. By utilizing its FIFO instruction queue, the retains the order in which the instructions appeared on the bus. Encoded in each instruction are the requested computation and the address of the shared variable. The memory address is used to identify a storage register that is holding the current value of the shared variable, or to allocate a storage register if this is a new shared variable. This storage register and, if present, the immediate number encoded in the instruction are taken as inputs by one of the functional units for the execution of the instruction. The write unit is dedicated to the execution of write-only instructions. It does not need input from any storage registers, instead, it writes to a target storage register. To allow concurrency when executing accumulative instructions, reservation stations are used to buffer instructions for the functional units, which is similar to the Tomasulo design [1]. With reservation stations, if an instruction in the queue needs the result that is still being computed by a functional unit, the instruction can be issued, which will be waiting for the result at a reservation station instead of blocking the entire issuing process. Outputs from the functional units are written to the storage registers and, if necessary, forwarded to reservation stations. Due to the use of reservation stations, instructions may be executed out of order. For shared accesses, however, the order of execution is extremely important to 3

4 P1 P2 P3 P4 queue Completion buffer completion Status Inst. ID Data for read op. Output buffer Destination Data Read operation reaches bottom of buffer? issue Memory data Write results Storage registers Memory address Status Reference statistics Data Mem Replacement Reservation stations Functional units Write unit INT adders FP adders Read operands More units Figure 2. The proposed Accumulative Execution Unit (). The illustrative hypothetical architecture is a quad-core processor with shared memory. the correctness of the result. To maintain in-order instruction completion in face of out-of-order execution, a FIFO completion buffer is introduced to track the execution status of all the issued instructions. Whenever an instruction is issued to a reservation station, it is also dispatched to the FIFO completion buffer with an initial status of not-completed. The instruction status in constantly updated whenever a functional unit completes an instruction. Once the instruction at the bottom (assuming insertion occurs at the top of the buffer) changes to completed, it will be cleared out of the buffer and the completion buffer shifts downwards by one entry. Internally at the functional units, the instructions are executed out of order, but if the processor cores attempt to read the value of a shared variable, whether it is readonly or compute-and-read, the read operation will not be performed until the instruction reaches the bottom of the completion buffer, i.e. after all the previous instructions have completed. Thus from the processors point of view, instructions are completed in program order, although the actual execution is out-of-order (for improved concurrency). Note that the data for a read operation may become available before the read reaches the bottom of the completion buffer, in which case the data will be saved in the data for read op. field of the completion buffer. When replacement is considered, instruction issue, execution, and completion still follow the same procedure but a few more complicated details are involved. The details of replacement are presented later in this section. Sequentialization. As illustrated in Figure 2, the has a single interface snooping the bus for accumulative instructions. Because the bus is a shared media that can accommodate only one request at a time, accumulative instructions are automatically sequentialized in the order that they appear on the bus. The order is then preserved by the as it tracks instruction completion status and only allows the processor cores to read shared variables in the order that the instructions were issued. More importantly, the access order is obtained without using locks to control thread advancement at any of the processor cores. Support for Read Operations. A read operation can be either a read only instruction or the read component in a compute and read instruction. In either case, the read operation will be executed (i.e. sending data to the processor) when it reaches the bottom of the completion buffer. The following two scenarios may occur for read 4

5 operations. 1. The shared variable is not present in a storage register. This may occur in two scenarios: (a) this is the first ever access to the shared variable so it has never been retrieved by the before, or (b) the shared variable was in the but was replaced. It was written back to the memory and hence is absent in the. For read only instructions, upon detecting the absence of the requested variable, the forwards the read request to the memory via the bus. The forwarded request is of the same format as a normal memory read request by the processors. The response from the memory will be captured by both the requesting processor and the. For the processor, this completes the read. For the, capturing the response triggers the allocation of a storage register for the requested variable so that future accesses to this variable can be handled within the. For compute and read instructions, the first sends a read request to the memory to load the shared variable into the, replacing an existing storage register if necessary. The then performs the requested computation and sends the result to the requesting processor core via the bus. 2. The shared variable is present in the. In this scenario, the read operation is performed when the read operation reaches the bottom of the completion buffer, which occurs after all preceding instructions have been completed and cleared out of the completion buffer. If it is a read only instruction, the value in the data for read op. field will be sent onto the bus; if it is a compute and read instruction, the compute component may be executed early, but the result is sent to the processor only after the instruction reaches the bottom of the completion buffer. Replacement. Replacement is initiated during the issue stage when an instruction needs to access a new shared variable but all the storage registers have been allocated to other shared variables. Since the shared variable is not in any of the storage registers, the sends a memory request to retrieve the shared variable from the memory. At the same time, the issue process will be stalled until a storage register becomes available. First, the replacement policy will be used to select a storage register, which is of course currently allocated to another shared variable. Depending on whether there are pending instructions at the reservation stations that need to update the selected storage register, contents in the storage register will be sent to the output buffer either immediately or after those instructions complete. As soon as the selected storage register is written to the output buffer, the issue process resumes, which will send the stalling instruction to the completion buffer as well as a proper reservation station. In the mean time, the memory address field of the selected storage register will be changed to the new shared variable. The instruction execution will start once the memory responses with the request data, and, of course, if a functional unit is available. To select a storage register for replacement, there can be a series of polices to choose from, like in the case of cache replacement. Possible policies include Least Recently Used (LRU), Random, First-In First-Out (FIFO), Pseudo LRU. In addition to these choices, the Least Outstanding s (LOI) policy is designed according to the buffering feature of the. LOI selects a storage register that has the least pending instructions. The LOI policy minimizes the time required to clear the storage register so that new instructions can dispatched as early as possible. At the time of this paper submission, a simulator is still being developed to evaluate design parameters. Quantitative study of the replacement policy will be reported in a subsequent paper with considerations in throughput, response time, and the avoidance of oscillation (where a particular shared variable is replaced again and again). 4 Consistency Property of The preserves sequential consistency for accesses to shared variables. Formally, sequential consistency is the property of a multi-processor system such that, for the execution of any program, an equivalent result can be obtained by hypothetically sequentializing all the shared variable accesses. In the hypothetical order, accesses by any particular processor retain the order that they were issued in the execution. [2] In the design, accesses to each shared variable take the form of accumulative computations, which are physically sequentialized as they are issued to the via the bus. Such a sequential order is further retained by the instruction queue where instruction completion is monitored by the completion buffer. Consequently, accesses to all the shared variable will be completed in the order in which they were issued onto the bus (though the execution may be performed out of order for performance considerations). Such an order satisfies the requirement of sequential consistency and thus the design preserves the property of sequential consistency. 5

6 5 Performance Evaluation In this section, the performance of proposed design is compared against the conventional lock-based mechanism, using the application of shared task queue as illustrated in Figure 3. Processor 1 Processor Processor n design can significantly reduce the enqueue and dequeue overhead, and possibly eliminate it if assuming the processor supports out-of-order execution. The pseudo code is listed below: while ( head tail){ addr R1,head,1; //increase head atomically execute task; task=dequeue R1; //get task for next iteration } Dequeue (execute tasks) Task queue (shared) Enqueue (generate tasks) List 1: Pseudo code for accessing a shared task queue with. addr is an accumulative instruction. head and tail are the shared pointers to queue. R1 is a register inside the processor core. Queue head pointer (shared) Queue tail pointer (shared) Figure 3. The sample application of a shared task queue. Tasks are generated on the fly, which are inserted into the queue then extracted for execution. The task queue is shared, so are the head and tail pointers. To implement a task queue using the conventional lock-based mechanism, a lock is needed to protect the queue head pointer and another lock is needed for the queue tail pointer. Task extraction can be performed only after a successful acquisition of the lock, after which the processor needs to advance the head pointer and releases the lock. Similarly, task insertion also needs to be protected by locks. The overhead of locking is in the acquisition and releasing process. For example, with test-and-set locks, lock acquisition generates excessive bus traffic: each time a test-and-set instruction is executed to check whether the lock is free, it writes to the cache block that contains the lock variable, which will result in a bus transaction invalidating the block cached by another processor (which wrote to the block when executing its test-and-test ). Therefore, processors generate bus transactions repeatedly while waiting for the lock to become free. Such contention consumes the bus bandwidth and slows down the lock acquisition time as the number of processors increases. Previous studies have shown that the acquisition time of test-and-set locks increases almost linearly with the number of processors. Other improved lock implementations (such as the ticket lock and the load-locked/store-conditional lock) do provide shorter acquisition time, but still exhibit close-tolinear increase as the number of processors increases [2]. Compared with lock-based schemes, the proposed At the beginning of each iteration is an accumulative addr instruction which will atomically increase the queue head pointer to the next queue entry, thus preparing the dequeue operation for the next iteration. The processor then proceeds with the execution of the current task. Upon completion of the task, the processor loads the task for the next iteration using the task id that was just updated by the addr instruction. The code above only generates two bus transactions each time when the queue head pointer is updated: first an addr request to the and later on a response from the, which is the minimum required number to update a shared variable. Processors do not compete for the accesses to the shared queue head pointer as the sequentializes all the requests and executes each one of them atomically. If the processor supports out-of-order execution, then the execution of current task can start without waiting for the first addr instruction to complete. As long as the sends back the updated head pointer before the completion of current task, the dequeue operation at the bottom of each iteration can start without any delay. This way, the overhead of accessing the shared queue head pointer is completely hidden behind the execution of the task. In case the processor does not support out-of-order execution, the overhead of dequeuing will be the cost of the two aforementioned bus transactions, which is still significantly lower than the lock acquisition time. In the proposed method, software pipelining is used to executes a task while waiting for the s response. This technique cannot be used to effectively reduce the overhead in lock-based schemes because whenever a lock is used, whether it is at the beginning or the end of an iteration, multiple processors will compete for the lock, resulting in multiple bus transactions and certain 6

7 amount of acquisition time, which unfortunately also increases with the number of processors. In summary, the proposed design outperforms the conventional lock based synchronization methods with its capability of eliminating synchronization overheads. Note that the performance superiority is obtained for the class of applications that can be implemented using accumulative instructions. The identification of such applications will be part of future research directions. 6 Discussions This paper presents the overall logical design that enables the functionality of the. It has been demonstrated that the has the potential of overhauling synchronization mechanism in multi-core architectures. As part of an on-going research effort in improving concurrency for multi-core architectures, the presented work is by no means the conclusion of the design. To fully explore the potential of the, investigations are being conducted in the following three directions. First, design details and trade-offs of the hardware need to be investigated. Currently, a cycle-accurate performance multi-core architecture simulator is being developed to support the design. The simulator will provide quantitative insight into the design, thus help answering a wide range of design questions. For example, What is the optimal number of storage registers?, Is the optimality of replacement policy application specific?, or Compared with lock-based scheme, what is the exact amount of reduction in the synchronization overhead? Because answers to these questions are generally application dependent, the gcc compiler is being ported to support the design so that benchmark programs can be used to in the investigation. Second, the presented design needs to be updated to support more relaxed consistency models. Instead of implementing a single sequential order for all the shared accesses, the can support partial orders or individual orders for each shared variable. Relaxed consistency models are in general more difficult to program but typically leads to faster execution time. These variations of consistency models need to be supported so that further design trade-off can be explored. Third, algorithmic studies need to be conducted to identify the class of accumulative instructions that can be supported by the, and to developed new multithreaded algorithms to accelerate a wide range of applications. The example of shared task queue demonstrates the superiority of the design over lockbased schemes. More complicated problems need to be studied to substantiate the general applicability the design. Some examples of the problems to be solved include graph problems (spanning tree, shortest path, etc.), achieving consensus among multiple threads (important for determining the termination of multi-threaded algorithms), etc. These problems have been studied for existing architectures, the proposed investigation will focus on the development of new algorithms that can avoid/reduce synchronization overheads with the help of the. References [1] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition. Morgan Kaufmann Publishers, [2] D. Culler, J. P. Singh, and A. Gupta., Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, [3] A. S. Tanenbaum, Modern Operating Systems, 2nd Edition. Prentice Hall, [4] W. Wolf, The future of multiprocessor systemson-chips, in DAC 04: Proceedings of the 41st annual conference on Design automation. ACM Press, 2004, pp [5] L. Spracklen and S. G. Abraham, Chip multithreading: Opportunities and challenges, in HPCA 05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 2005, pp [6] P. Stenstrom, A survey of cache coherence schemes for multiprocessors, Computer, vol. 23, no. 6, pp , [7] B.-H. Lim and A. Agarwal, Waiting algorithms for synchronization in large-scale multiprocessors, ACM Transactions on Computer System, vol. 11, no. 3, pp , [8] T. Knight, An architecture for mostly functional languages, in LFP 86: Proceedings of the 1986 ACM conference on LISP and functional programming. ACM Press, 1986, pp [9] N. Shavit and D. Touitou, Software transactional memory, Journal of Distributed Computing, vol. 10, no. 2, pp , [10] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie, Unbounded transactional memory, IEEE Micro, vol. 26, no. 1, pp ,

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Per-Thread Batch Queues For Multithreaded Programs

Per-Thread Batch Queues For Multithreaded Programs Per-Thread Batch Queues For Multithreaded Programs Tri Nguyen, M.S. Robert Chun, Ph.D. Computer Science Department San Jose State University San Jose, California 95192 Abstract Sharing resources leads

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Computer Science and Engineering CS6302- DATABASE MANAGEMENT SYSTEMS Anna University 2 & 16 Mark Questions & Answers Year / Semester: II / III

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Malaysian Journal of Computer Science, Vol. 9 No. 1, June 1996, pp. 12-17 REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Mohammed Samaka School of Computer Science Universiti Sains Malaysia

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Hardware Accelerators

Hardware Accelerators Hardware Accelerators José Costa Software for Embedded Systems Departamento de Engenharia Informática (DEI) Instituto Superior Técnico 2014-04-08 José Costa (DEI/IST) Hardware Accelerators 1 Outline Hardware

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information