London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

Size: px
Start display at page:

Download "London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency"

Transcription

1 Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College of Science, Technology and Medicine London SW7 2BZ Abstract. Some shared-memory applications have execution times linear in the number of processors due to unfortunate allocation of the home and ownership of cache lines. We present a modied coherency protocol which avoids this eect. Read requests are routed via \proxies", randomly-selected intermediate nodes. We present results from executiondriven simulations of a cc-numa architecture which show that proxying can yield a large speedup in cases where read contention is extreme, while only causing small slowdowns in other benchmarks. We investigate how many proxies should be used and what eect the scheme has on trac levels and queuing of requests at node controllers. 1 Introduction Coherent-cache shared-memory multiprocessors can suer from disastrous contention eects, especially in large congurations [2]. In some cases, an apparentlyparallel application can show execution time proportional to the number of processors used. In this paper we study one potential cause for such behaviour, establish the scale of the eect, and investigate remedies. Each processor's memory and cache is managed by a \node controller". In addition to local memory references, the controller must handle requests arriving via the network from other processors. These requests concern cache lines owned by this cache (reads, ownership requests), lines of which a copy is held in this cache (invalidations and replacements), and lines whose \home" is this node, i.e. this node holds directory information about the line. It is obviously important that controllers can handle requests at a high rate. This is exacerbated in large congurations where unfortunate ownership migration or home allocation can lead to concentrations of requests at particular nodes. An interesting alternative is to distribute the workload to other node controllers, essentially using them to act as \proxies" for read requests. When a processor makes a read request, instead of going directly to the cache line's home, we route it rst to another node. If the proxy node has the line, it replies directly. If not, it requests the value from the home itself, allocates it in its own cache, and replies. We present results from simulation experiments which evaluate this idea.

2 2 Contention in Shared-Memory Multiprocessors Each node consists of a processor, with an integral rst level cache (flc), a large second-level cache (slc), some dram and a node controller. The slc, dram and controller are interconnected by two decoupled buses. The controller sends messages to, and receives messages from, the network and the processor. We model a cc-numa architecture with an invalidation-based coherency protocol which maintains sequential consistency. The identities of nodes which have cached a particular line are maintained using distributed singly-linked lists using a protocol similar to that outlined in [3]. Each cache line has a \home" node associated with it (at the granularity of a page) which: { Either holds a valid copy of the line (in slc, dram, or both), or knows the identity of a node which does, { Has pre-allocated space in dram to which the nal replacement of the line from cache can take place, and { Holds directory information for the line (head and state of the sharing list) in dram. When a memory reference cannot be completed on a client node a request is sent to the home node. When the read request is serviced at the home node, the controller performs a lookup in dram to determine the state of the line. Assuming that the state is exclusive, the slc bus is acquired and a lookup is done in the slc. If the data is not present, the slc bus is released, the data is read from dram, directory information is updated in dram, the mem bus released, and the reply message is dispatched. Note that the processor is prevented from accessing the slc for part of the transaction. In addition, whilst the request is being serviced, other requests may arrive which cannot be serviced until the controller has nished this transaction. Contention of this form can occur for either homes or owners: ownership of many cache lines with dierent homes may become concentrated in a single cache because of the application's write behaviour. Conversely, directory trac for a large set of lines, whose ownership is dispersed, may be concentrated in a single home node due to home allocation. 2.1 The Impact of Node Controller Contention The severity of controller contention is both application and architecture dependent. Some contention is inevitable and will result in the latency of transactions being elongated. The communications access pattern is non-uniform primarily because of the way homes and ownership are allocated. It is the non-uniform distribution of requests made by the application which causes the variation in contention over the execution time of the program. The characteristics of the architecture determine how eectively the non-uniform distribution of requests can be resolved. If the network is relatively fast and controller occupancy high, it is possible that requests can arrive at a controller at such a high rate that contention will occur.

3 3 Proxies Proxies form the basis of a technique designed to reduce the queuing of requests at controllers. Proxies also achieve combining: if multiple requests for the same cache line are sent to the same proxy, only the rst requires a request to be made to the home. When the data is supplied to the proxy, it sends it to all the clients which are waiting for it. This is done by using proxy cache entries to form \pending chains", distributed lists of the nodes which have requested a particular line. When the proxy receives the line, it is added to its slc and sent down the pending list of clients which have requested the data. 3.1 Selective Use of Proxying The overhead of resolving read requests via proxies can be two extra messages, and therefore to avoid unnecessary overheads only data structures which are contended for should be proxied. We control this on a page-by-page basis, so that proxying is enabled only for selected memory regions. We determine which data structures should be proxied by analysing traces from simulations using a trace analysis tool (see our companion paper in these proceedings) to determine which lines and pages are widely shared, and may therefore benet from proxying. To evaluate the cost of proxying in circumstances when it is not benecial, we have simulated runs in which all shared data is proxied. 3.2 Selecting the Proxy and the Proxy Set The requesting processor can select any member of the proxy set, i.e. the nodes which act as proxies for the requested line. In a clustered system it would make sense to allocate a proxy in each cluster. Alternatively, in some network designs it may be possible to choose a proxy to avoid congestion. Perhaps the most interesting possibility is to choose the proxy at random each time. This should reduce network contention and balance load evenly between the elements of the proxy set { this is the policy we simulate here. In general there is a trade-o in the number of proxies to use: too few proxies and there may still be contention, this time for proxies. Too many, and there will be little combining eect: requests from the proxies will cause contention. Although some benchmarks may benet from a large number of proxies, in less extreme examples a smaller proxy set should give better results because more combining will happen. 3.3 Potential Costs and Benets Among the benets we should look for are: { Queuing to gain copies of data served by a single controller, due to unfortunate data migration: contention from this source should be reduced, since the controller managing the data need only service requests from members of the proxy set, while other requests will be handled by proxies.

4 { Queuing for directory information served by a single node controller, due to unfortunate allocation of \homes": contention from this source should also be reduced. { Blocking in the interconnection network, due to unfortunate communications patterns: this should be reduced since, although the overall trac level is somewhat higher, the distribution should be more uniform. The potential costs are: { Every load (for addresses subject to proxying) must now go via a proxy, whereas with the basic protocol no indirection would be involved. For example, a simple client-home-client round trip can now take ve messages instead of three. { Cache pollution: allocating space in the cache for proxying may displace another line, and lead to a later cache miss. Extra invalidations are required also. { Hardware complexity: controllers must be able to deal with multiple outstanding requests (as required for multithreading) and be able to represent the pending chains required for combining. Our goal in this paper is to study the costs and benets of the proxy scheme, and to try to quantify the above eects. 4 Simulated Architecture and Benchmarks The simulated system consists of a set of nodes interconnected by a crossbar network. First level caches are write-through, direct-mapped, and 4KB in size. Instruction accesses are assumed to be dealt with by a separate, perfect memory system. Second level caches are write-back, direct-mapped, and 1MB in size. The line size is 64 bytes throughout. The clock speed of the system (other than processors) is 100 MHz. Latencies for various operations, expressed in terms of 10 ns clock cycles, are shown in Table 1. We have used a network bandwidth of 160 MB/s. Messages consist of a header (containing a type, and source, destination, requester and home identiers) and possibly a 64 byte payload. ge is a simple Gaussian elimination program, similar to that used by Bianchini and LeBlanc in their study of proxies [1]. At the end of each iteration a single processor updates a row of the matrix which is designated as the pivot row. Following a barrier, all processors read this row and use it to update a set of rows which they maintain. It is immediately clear that this will cause contention since the cache lines holding the pivot row will all reside in a single cache. The entire 256x256 matrix is annotated so that proxying is used. Trace analysis did not indicate that any particular data structure used by barnes [4] would cause contention problems, and as a result all shared data has been marked for proxying. Three iterations with particles are used. fmm [4] was run for a two-cluster Plummer distribution with cost zones partitioning, and the precision set at 1e-6. Trace results of a 32 node system showed

5 Table 1. Latencies of the Most Important Node Actions (100 MHz clock) Operation Time (cycles) Acquire slc bus 2 Release slc bus 1 slc lookup 6 slc line access 18 Acquire mem bus 3 Release mem bus 2 dram lookup 20 dram line access 24 Initiate message send 5 that queues of length 31 were occurring for access to elements of the f array which forms part of the G Memory data structure. The queuing occurs immediately after a barrier, when all the processors read all the elements of the f array. This was the most signicant case of read contention detected in fmm and it scales with the number of processors. It is independent of the number of particles, so simulations were run for a small problem size of 4096 particles and three iterations to reduce simulation time. 5 Simulation Results Our simulations have shown that contention only becomes an important issue when more than a few tens of nodes are used. For this reason the results presented below are from simulations of 64 node machines. The graph showing the variation in execution time with the number of proxies for ge (Figure 1) indicates that using a single proxy results in a reduction in execution time of 35%, but higher numbers of proxies do not produce further reductions. Execution time worsens slightly when more that ve proxies are used. Further instrumentation has been used to explain this in the following sections. 5.1 Queuing at Controllers The total number of cycles for which messages are waiting to be serviced is counted during simulation, and used to determine the relative buer delay, i.e. the buer delay observed for a particular number of proxies divided by that observed without proxying. For ge, buer delay is more than halved when proxying is used. This is a result of the write behaviour of the program which tends to concentrate ownership of individual lines on particular nodes. When proxying is used, messages are directed to nodes which do not act as home for any shared data, resulting in more uniform message distribution, and reduced queuing time. Interestingly, increasing the number of proxies beyond 1 oers little benet.

6 1 1 Relative Execution Time barnes fmm ge Relative Buffer Delay fmm barnes ge Relative Number of Messages Sent Number of Proxies barnes ge fmm Number of Proxies Proxy Hit Rate Number of Proxies fmm ge barnes Number of Proxies Fig. 1. Relative Measurements as a Function of Number of Proxies 5.2 Variation in Message Trac Proxying is designed to reduce queuing, but does so at the expense of increased network trac. Counts of the total number of messages sent during simulations were used to produce the graphs in which the variation in the relative number of messages is plotted against the number of proxies. In the case of ge, the overhead is clearly visible: 27% more messages are required when proxies are used. However, since the distribution of these messages is more uniform, the resulting contention is less severe. Note that little variation in message counts is observed as the number of proxies is increased beyond one. Using proxies results in higher network trac because read requests are routed via the proxy (rather than directly to the home), and sharing lists are longer resulting in more messages being required for invalidations and replacements. 5.3 Proxy Hit Rate This graph provides a measure of the eectiveness of proxies at moving network trac away from (possibly congested) home nodes. We dene a proxy hit to be a read request served by a proxy which either returns data directly, or adds the

7 requester to a pending list. A proxy miss requires the data to be fetched from the home node. The graph for ge shows a constant proxy hit rate because every node accesses each line of the pivot row, and therefore a high rate of combining is achieved. 5.4 fmm and barnes fmm shows only a minor change in overall execution time when proxies are used, although message queuing is reduced by about 30%. Together, the graphs showing the variation in network trac and proxy hit rate show that proxying is eective at keeping messages away from congested home nodes. However, the data structure marked in this case is small and accesses to it form a relatively small part of overall execution time. barnes is an interesting test of proxying since all its shared data structures have been marked for proxying. Despite the overhead that this could introduce, it has a relatively minor adverse eect on execution time. Network trac is again increased and buer delay reduced, resulting in only a small increase in load and store delays. The three benchmarks programs exhibit a range of behaviours. Although only ge shows a substantial improvement in execution time, the results have demonstrated that more random message delivery can result in reduced buering of messages, despite the inevitable increase in network trac. fmm showed little change in execution time because the marked data structure is only used in a small part of the program. The performance of barnes was not adversely aected, despite all data being proxied. For all programs, one or two proxies tend to realise most of the advantage that can be achieved. This is a positive result since using more proxies will result in increased cache pollution, invalidation and replacements. 6 Related Work Holt et. al. investigated the eect of varying network latency and occupancy on the performance of similar benchmarks, and concluded that controller occupancy has a large impact on performance [2]. Increasing problem size to maintain parallel eciency was shown to be ineective in many cases since it leads to unreasonably large problem sizes. Our results also demonstrate that occupancy is of critical importance. In addition we have found that proxying can be highly eective at alleviating contention in programs, such as ge, which suer particularly badly. Our proxy scheme is similar to Bianchini and LeBlanc's \eager combining" idea [1]. The important dierence is that in eager combining, when a dirty value is rst read, the value is broadcast to all proxies. In our protocol, only the requesting proxy gets the value. Although we have not simulated eager combining, we appear to get most, if not all, of the advantage of eager combining on ge. The dierence matters since eager combining would incur a large overhead on other benchmarks. We have shown that our protocol leads to only modest overheads.

8 7 Conclusions This paper has presented the proxy technique, discussed the design and implementation of a proxying cache coherence protocol, and presented some preliminary simulation results. We have shown that proxies benet some applications immensely, as expected, while for other benchmarks they do not lead to a substantial slowdown. We believe that the reason proxies do not substantially increase execution time, despite increasing the overall number of messages needed, lies in the eect of proxying on the distribution of network trac. This needs further analysis, but it would appear that performance is inuenced more by anomalous contention eects than by the overall average network trac level. Much if not all of the benet of proxying is realised with just one proxy per location, although there is some evidence that two proxies may be justied for some applications. We believe the reason for this is that severe read contention must involve data structures of considerable size, and the proxied copies are then spread uniformly across the machine. In summary, we have shown that proxies can be added to a cc-numa cache coherence protocol fairly easily, that proxying can improve some applications' performance (a 50% speedup for ge), appears to risk only a small slowdown for other applications, and that one proxy per location is likely to be enough. Acknowledgements. This work was funded by the U.K. Engineering and Physical Sciences Research Council through project GR/J (Combining Randomisation and Mixed-policy Caching for Bounded-contention Shared Memory). Enormous thanks are due to Ashley Saulsbury for allowing us to use his simulator. References 1. Ricardo Bianchini and Thomas J. LeBlanc. Eager combining: a coherency protocol for increasing eective network and memory bandwidth in shared-memory multiprocessors. In 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, October, pages 204{213, Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The eects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-660, Computer Systems Laboratory, Stanford University, January Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In International Conference on Parallel Processing, Pennsylvania State University, August, volume 1, pages 1{10, Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd Annual International Symposium on Computer Architecture, in Computer Architecture News, pages 24{36, June This article was processed using the LaT E X macro package with LLNCS style

Reactive Proxies: a Flexible Protocol. Extension to Reduce ccnuma. fsamt,

Reactive Proxies: a Flexible Protocol. Extension to Reduce ccnuma. fsamt, Reactive Proxies: a Flexible Protocol Extension to Reduce ccnuma Node Controller Contention Sarah A. M. Talbot and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine

More information

1 STABLE PERFORMANCE FOR CC-NUMA USING FIRST TOUCH PAGE PLACEMENT AND REACTIVE PROXIES. Imperial College of Science, Technology and Medicine

1 STABLE PERFORMANCE FOR CC-NUMA USING FIRST TOUCH PAGE PLACEMENT AND REACTIVE PROXIES. Imperial College of Science, Technology and Medicine 1 STABLE PERFORMANCE FOR CC-NUMA USING FIRST TOUCH PAGE PLACEMENT AND REACTIVE PROXIES Sarah A. M. Talbot and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Akshay Chander, Aravind Narayanan, Madhan R and A.P. Shanti Department of Computer Science & Engineering, College of Engineering Guindy,

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Maged M. Michael y, Ashwini K. Nanda z, Beng-Hong Lim z, and Michael L. Scott y y University of Rochester z IBM Research Department

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

London SW7 2BZ. Abstract. Parallel functional programs based on the graph reduction

London SW7 2BZ. Abstract. Parallel functional programs based on the graph reduction Eliminating Invalidation in Coherent-Cache Parallel Graph Reduction Andrew J. Bennett and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine London SW7 2BZ Abstract.

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis,

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis, Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of

More information

Abstract. Cache-only memory access (COMA) multiprocessors support scalable coherent shared

Abstract. Cache-only memory access (COMA) multiprocessors support scalable coherent shared ,, 1{19 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Latency Hiding on COMA Multiprocessors TAREK S. ABDELRAHMAN Department of Electrical and Computer Engineering The University

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report Minimizing the Directory Size for Large-scale DSM Multiprocessors Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis,

More information

RICE UNIVERSITY. The Impact of Instruction-Level Parallelism on. Multiprocessor Performance and Simulation. Methodology. Vijay S.

RICE UNIVERSITY. The Impact of Instruction-Level Parallelism on. Multiprocessor Performance and Simulation. Methodology. Vijay S. RICE UNIVERSITY The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology by Vijay S. Pai A Thesis Submitted in Partial Fulfillment of the Requirements for the

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu

More information

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4 Algorithms Implementing Distributed Shared Memory Michael Stumm and Songnian Zhou University of Toronto Toronto, Canada M5S 1A4 Email: stumm@csri.toronto.edu Abstract A critical issue in the design of

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Shared Memory Architecture Part One

Shared Memory Architecture Part One Babylon University College of Information Technology Software Department Shared Memory Architecture Part One By Classification Of Shared Memory Systems The simplest shared memory system consists of one

More information

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, luigig@cs.uwa.edu.au Abstract Clustering is a technique

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

RICE UNIVERSITY. The Eect of Contention on the Scalability of. Page-Based Software Shared Memory Systems. Eyal de Lara. A Thesis Submitted

RICE UNIVERSITY. The Eect of Contention on the Scalability of. Page-Based Software Shared Memory Systems. Eyal de Lara. A Thesis Submitted RICE UNIVERSITY The Eect of Contention on the Scalability of Page-Based Software Shared Memory Systems by Eyal de Lara A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master

More information

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s. Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures Shashank S. Nemawarkar, Guang R. Gao School of Computer Science McGill University 348 University Street, Montreal, H3A

More information

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy ONE: The Ohio Network Emulator Mark Allman, Adam Caldwell, Shawn Ostermann mallman@lerc.nasa.gov, adam@eni.net ostermann@cs.ohiou.edu School of Electrical Engineering and Computer Science Ohio University

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science Analytical Modeling of Routing Algorithms in Virtual Cut-Through Networks Jennifer Rexford Network Mathematics Research Networking & Distributed Systems AT&T Labs Research Florham Park, NJ 07932 jrex@research.att.com

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters Angelos Bilas and Jaswinder Pal Singh Department of Computer Science Olden Street Princeton University Princeton,

More information

Performance of MP3D on the SB-PRAM prototype

Performance of MP3D on the SB-PRAM prototype Performance of MP3D on the SB-PRAM prototype Roman Dementiev, Michael Klein and Wolfgang J. Paul rd,ogrim,wjp @cs.uni-sb.de Saarland University Computer Science Department D-66123 Saarbrücken, Germany

More information

Minimizing the Page Close Penalty: Indexing Memory Banks Revisited

Minimizing the Page Close Penalty: Indexing Memory Banks Revisited Minimizing the Page Close Penalty: Indexing Memory Banks Revisited Tomas Rokicki Computer Systems Laboratory HP Laboratories Palo Alto HPL-97-8 (R.) June 6 th, 23* memory controller, page-mode, indexing,

More information

HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS. A Dissertation RAVISHANKAR IYER. Submitted to the Oce of Graduate Studies of

HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS. A Dissertation RAVISHANKAR IYER. Submitted to the Oce of Graduate Studies of HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS A Dissertation by RAVISHANKAR IYER Submitted to the Oce of Graduate Studies of Texas A&M University in partial fulllment of the requirements

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES

PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES N. Ramasubramanian 1, Srinivas V.V. 2 and N. Ammasai Gounden 3 1, 2 Department of Computer Science and Engineering, National Institute

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE

RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE RAPID HARDWARE PROTOTYPING ON RPM-2: METHODOLOGY AND EXPERIENCE Michel Dubois, Jaeheon Jeong, Yong Ho Song, and Adrian Moga Department of Electrical Engineering - Systems University of Southern California

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Author's personal copy

Author's personal copy J. Parallel Distrib. Comput. 68 (2008) 1413 1424 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Two proposals for the inclusion of

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Shared Memory. SMP Architectures and Programming

Shared Memory. SMP Architectures and Programming Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit

More information

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

CPACM: A New Embedded Memory Architecture Proposal

CPACM: A New Embedded Memory Architecture Proposal CPACM: A New Embedded Memory Architecture Proposal PaulKeltcher, Steph en Richardson Computer Systems and Technology Laboratory HP Laboratories Palo Alto HPL-2000-153 November 21 st, 2000* edram, embedded

More information

Design Trade-Offs in High-Throughput Coherence Controllers

Design Trade-Offs in High-Throughput Coherence Controllers Design Trade-Offs in High-Throughput Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation Santa Clara, CA 9 anthony.d.nguyen@intel.com Josep Torrellas Department of

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

AODV-PA: AODV with Path Accumulation

AODV-PA: AODV with Path Accumulation -PA: with Path Accumulation Sumit Gwalani Elizabeth M. Belding-Royer Department of Computer Science University of California, Santa Barbara fsumitg, ebeldingg@cs.ucsb.edu Charles E. Perkins Communications

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

TASK FLOW GRAPH MAPPING TO ABUNDANT CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC LIMITED Parallel Processing Letters c World Scientic Publishing Company FUNCTIONAL ALGORITHM SIMULATION OF THE FAST MULTIPOLE METHOD: ARCHITECTURAL IMPLICATIONS MARIOS D. DIKAIAKOS Departments of Astronomy and

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Thrashing in Real Address Caches due to Memory Management. Arup Mukherjee, Murthy Devarakonda, and Dinkar Sitaram. IBM Research Division

Thrashing in Real Address Caches due to Memory Management. Arup Mukherjee, Murthy Devarakonda, and Dinkar Sitaram. IBM Research Division Thrashing in Real Address Caches due to Memory Management Arup Mukherjee, Murthy Devarakonda, and Dinkar Sitaram IBM Research Division Thomas J. Watson Research Center Yorktown Heights, NY 10598 Abstract:

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Impact of the Head-of-Line Blocking on Parallel Computer Networks: Hardware to Applications? V. Puente, J.A. Gregorio, C. Izu 1, R. Beivide Universida

Impact of the Head-of-Line Blocking on Parallel Computer Networks: Hardware to Applications? V. Puente, J.A. Gregorio, C. Izu 1, R. Beivide Universida Impact of the Head-of-Line Blocking on Parallel Computer Networks: Hardware to Applications? V. Puente, J.A. Gregorio, C. Izu 1, R. Beivide Universidad de Cantabria 39005 Santander, Spain e-mail:vpuente,jagm,mon@atc.unican.es

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Architectural Requirements and Scalability of the NAS Parallel Benchmarks

Architectural Requirements and Scalability of the NAS Parallel Benchmarks Abstract Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler Computer Science Division Department

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN)

More information

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies Computers and Electrical Engineering 26 (2000) 207±220 www.elsevier.com/locate/compeleceng Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection

More information

ECE453: Advanced Computer Architecture II Homework 1

ECE453: Advanced Computer Architecture II Homework 1 ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems

Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation Seattle, Washington, October 1996 Performance Evaluation of Two

More information