Cache Simulation Based on Runtime Instrumentation for OpenMP Applications

Size: px
Start display at page:

Download "Cache Simulation Based on Runtime Instrumentation for OpenMP Applications"

Transcription

1 Cache Simulation Based on Runtime Instrumentation for OpenMP Applications Jie Tao and Josef Weidendorfer Institut für Rechnerentwurf und Fehlertoleranz Lehrstuhl für Rechnertechnik und Rechnerorganisation Universität Karlsruhe Institut für Informatik, Technische Universität München Karlsruhe, Germany Garching, Germany Abstract To enable optimizations in memory access behavior of high performance applications, cache monitoring is a crucial process. Simulation of cache hardware is needed in order to allow research for non-existing cache architectures, and on the other hand, to get more insight into metrics not measured by hardware counters in existing processors. One focus of EP-Cache, a project investigating efficient programming on cache architectures, is on developing cache monitoring hardware to give precise information about the cache behavior of OpenMP applications on SMP machines. As the hardware is still in an early state of development, getting experience with the monitoring software infrastructure to be built for use in real applications requires cache simulation. Two techniques are used for the cache simulation engine: driven by instrumentation integrated at source level and instrumentation integrated at runtime by rewriting code onthe-fly. In this paper, we mainly describe the second technique together with a sample code, showing the advantages and feasibility of this approach. Additionally, in order to allow a comparison, we also give a brief description of the experience with the source instrumentation technique. 1 Introduction EP-Cache 1 is a project aiming at providing a hybrid monitoring infrastructure (consisting of hard- and software components) for optimizing the cache access behavior of realistic OpenMP applications. This well-layered infrastructure comprises performance tools at the top level, hardware monitors on the bottom, and multilevel interfaces in 1 More information at the middle. Performance tools, like automatic performance analyzer and performance data visualizer, are used for tuning cache performance, while hardware monitors trace the memory traffic and supply the needed information for tools to do their work. The middleware between them provides an interface for requesting monitoring data. Since the core of this infrastructure, the hardware monitor, is still under development, a simulation environment is essential that is capable of offering accurate performance data like the hardware monitors and allowing the earlier development of performance tools and optimization techniques. A critical issue concerning simulation is the instrumentation technique. In the case of our source instrumentation with Augmint [12], the assembler code, generated by the compiler for a program, is parsed and modified. For each memory reference a simulator call is inserted to switch the control between target application execution and simulation processing. However, this approach is limited to special compilers and architectures, due to the assembler format which has to be parsed. Additionally, as the runtime part of Augmint uses its own thread model different from POSIX threads, we can not use the OpenMP functionality of the compilers which rely on POSIX threads, but need our own OpenMP implementation in form of the source-to-source translator ADAPTOR [1]. Augmint is not capable of instrumenting existing library code. It has to be assumed that cache pollution triggered by library calls only has minimal influence on simulation results. In contrast, runtime instrumentation covers all the code being executed, and does not depend on the compiler. The Valgrind runtime instrumentation framework [11] enables us to easily integrate this approach. Together with the support of POSIX threads by Valgrind, OpenMP implementations of different compilers can be compared. Another benefit is the support of the new multimedia instruction extensions generally found in Pentium III/IV (SSE/SSE2), which 1

2 Augmint currently can not handle. In the next section, we will give some details on the two instrumentation techniques used within the EP-Cache project. Afterwards, the cache simulator and adaptations needed for runtime instrumentation are presented. This is followed by a description of the software infrastructure for processing performance data acquired by the simulator. In Section 5, some experimental results in relation with a sample benchmark code are illustrated. Section 6 discusses related work. Finally, several directions for future research regarding the monitoring infrastructure are laid out. 2 Instrumentation Techniques Performance characteristics of applications can only be partly observed from the outside, especially if hardware access is not given. Even then, any observed data has to be attributed to some source level code position or data structure to be useful for the application developer. Therefore, instrumentation for performance analysis usually produces application related metrics while the application is running, optionally relates this to measurements from monitoring hardware, does some preprocessing of observed data, and dumps the data to the outside world. Instrumentation has to be present when executing the program, but can be added at different phases, namely: manually to application source (by the application programmer), manually to library code the application uses, automatically by rewriting source code, automatically by rewriting compiled code, automatically by patching compiled code at runtime directly before execution. The methods changing object code are platform dependent in contrast to source code rewriting, since the instruction stream for a given processor has to be parsed and extended. 2.1 Source Level Instrumentation Source instrumentation is usually done within the application source code and is commonly used for acquiring performance data. A special, architecture dependent version of source instrumentation is done within compiled assembly code, and is generally used in the research area of simulation for generating events of interest. Augmint [12] is a well-known example for using sourcelevel instrumentation. Augmint is a fast execution driven simulation toolkit for Intel x86 multiprocessor architectures. In order to generate memory references and synchronization events for modeling the target systems, it deploys a source-level instrumentor called Doctor. Doctor is used to insert additional instructions at the memory reference points to create memory access events. It takes as input an assembly code with required semantics and outputs an augmented assembly language file. Since compilers use different semantics for generating assembly codes, Doctor does not work on top of all compilers. This indicates that source-level instrumentation relies on compilers and can not be generally used for flexible simulation tools. 2.2 Runtime Instrumentation Runtime instrumentation always produces some overhead while running an application, and it destroys the original partitioning of wallclock time to execution time of single code ranges of the original application. However, it is adequate for driving hardware simulations, where wallclock time is not used at all. In addition, runtime instrumentation directly works with unmodified executables, hence does not rely on compilers and in some cases the programming models. An example of such a runtime instrumentation tool is Valgrind. Valgrind [11] is a CPU emulator, running mostly in a main loop which is executing instrumented versions of code ranges from the target program to be supervised. The code range granularity is at basic block level, i.e. control flow statements only happen at the end of the code pieces. The exact instrumentation depends on the use case and can be specified in custom plugins, called skins. For a code range to be executed next from the target program, and if no instrumented version exists in Valgrind s instrumentation cache, it goes through the following steps: 1. parse the IA-32 instructions and translate them to a virtual RISC instruction set, 2. let the custom instrumentation function insert additional instructions into the RISC instruction sequence (e.g. calls to C functions), 3. retranslate into IA-32 and put it into the instrumentation cache. After this, the new instrumented code range is executed immediately. The RISC instruction set is chosen for comfortable instrumentation possiblities: There is no limit on the number of virtual registers available, and special treatment of memory accesses needs checking of a few instructions only in contrast to the large number of IA-32 instructions which possibly involve memory accesses.

3 3 Flexible Cache Simulation for OpenMP 3.2 The Independent Simulator As described, an OpenMP simulation and monitoring environment is needed for the EP-Cache project. Our first approach for this is based on source-level instrumentation, together with an existing multiprocessor simulation tool based on Augmint. The second approach is based on Valgrind and does therefore not rely on compilers. It is hence currently used within this project for acquiring performance data. 3.1 The Initial Approach The first version of the cache simulator is built on top of a self-developed simulation system called SIMT [14]. SIMT is a multiprocessor simulator modeling the parallel execution of applications on shared memory machines. As it focuses on the research work in the area of the memory system, it contains a detailed simulator modeling the complete memory hierarchy including caches with arbitrary levels and various cache coherence protocols. In addition, it simulates hardware monitors that can be connected to any level of the memory hierarchy and that are capable of tracing full access histograms. Based on these monitoring facilities, SIMT provides accurate information about runtime accesses, enabling users to analyze the memory and cache access behavior of applications and further to optimize them resulting in better performance. SIMT uses Augmint as frontend for generating memory references and hence relies on Augmint s mechanisms which introduces some limitations. One limitation is caused by Augmint s instrumenter Doctor that depends on compilers that generate assembly codes. Another limitation comes from Augmint s thread modeling mechanism. Augmint does not simulate standard thread libraries, like the POSIX threads which is used by most shared memory programming models. Rather it uses its own thread structure to model user-level threads. Correspondingly, applications have to use specific macros to express parallelism within the source code. This limitation also exists within SIMT, since SIMT directly applies the thread structure of Augmint. Hence, SIMT depends on programming models, and OpenMP directives can not be simulated by SIMT. In order to tackle this problem, we had to modify the OpenMP library of the project compiler ADAPTOR [1], a source-to-source Fortran OpenMP compiler, in order to translate the OpenMP parallel pragmas to Augmint-specific macros. In addition, mechanisms with respect to threads scheduling and synchronizations had to be modified as well. In this way, an OpenMP simulation environment has been established. The cache simulator based on Augmint is capable of providing valuable performance data about the runtime memory accesses of OpenMP programs. However, it is specific for ADAPTOR and the Intel Fortran compiler ifc [9] due to Doctor. In order to provide an OpenMP cache simulator of general purpose, we chosed Valgrind in the second step of this research work. Actually, Valgrind itself contains a cache simulator, which supports two cache hierarchy levels with fixed write back behavior and LRU replacement policy. However, this simulator does not model SMPs, but uniprocessor systems. In order to enable this, we had to integrate the memory hierarchy simulator of SIMT into Valgrinds supervision process. Fortunately, SIMT provides a simple interface between its frontend, the memory reference generator, and the memory hierarchy simulator. This interface consists of the functions that implement the detailed operations of memory access in caches and the main memory. These functions are called by the frontend when an event is generated. For example, when a read event is generated, the function sim read() is called and when a write event is generated, sim write() is called. To switch the frontend from Augmint to Valgrind, these functions have been rewritten forming a pure interface not relying on the runtime environment of Augmint. On the side of Valgrind, the instrumentation plugin MemAccess has been developed as a bridge to the SIMT cache simulator. MemAccess traps all memory accesses performed at runtime during the execution and then calls application defined handlers, in our use case the SIMT functions. Handlers are linked into the target application, but are not target to instrumentation themselve, as they are registered with the Valgrind layer to be called directly when memory access events happen. The communication mechanism between an target application and Valgrind core is called Client Request: by using a C macro, a special no-op instruction sequence is compiled into target code, detectable by the Valgrind instrumentor and leading to a call into a skin supplied function. Thus, if an application is executed without MemAccess, no handlers will be called, i.e. no events are generated for the cache simulator. The basic client request of MemAccess is the registration of the handler functions to be called on memory accesses. Actual notification has to be explicitly switched on or off by another client request. The handler functions, which include the cache simulator, fill up event buffers depending on monitoring configuration. For these buffers to be read by the upper monitoring layer, which is target to instrumentation, the notification has to be switched off temporarily.

4 Monitor Control Component User Application MRI epapi Histogram Chain Ring Buffers Performance Visualisation Tools Software Hardware Low Level Monitor API Monitor 0 Monitor 1 Monitor n CPU 0 CPU 1 CPU n Figure 1. Software infrastructure of data processing 4 Data Processing Cache simulation usually involves a large amount of overhead for any application even with medium size, since every memory access has to be handled. This overhead will be strongly increased if the simulator additionally generates unmanageable huge amounts of data. This situation is specially critical for this work because hardware monitors capable of tracing every single event are used to collect memory access information. These monitors can be selectively connected to each location of the memory hierarchy, hence can create significant amount of data when working in a full tracing mode. In addition, the raw monitoring data is based on single events and physical addresses, and can therefore not directly be used by performance tools and applications for performance tuning. In order to solve this problem, a multilayer software infrastructure has been designed within EP-Cache for transforming the original performance data into a high level abstraction with respect to data structures. This not only results in smaller amounts of data but also simplifies the use at the high level. The software infrastructure of the data processing layers is shown in Figure 1. As illustrated in this figure, the monitoring data is first delivered from the hardware monitor to a ring buffer. This work is done through the low level driver and API of the hardware monitors. From there, data is sorted by the Monitor Control Component and stored into a histogram chain that is ordered by the addresses of corresponding memory blocks with cache line size, so called memory lines. In addition, the Monitor Control Component also combines the monitoring data from different monitors and probably different processors, and translates physical to virtual addresses. On top of this component, the epapi library further processes the monitoring data into statistical forms such as total number of single events and access histograms. Finally, the Monitoring Request Interface (MRI) maps the virtual addresses into data structures using the context table provided by some compilers or using the debugging information. Also within this component, the final high-level data abstraction is delivered to performance tools and applications for performance analysis. As the primary data processing component, the epapi library provides a set of functions not only for data processing but also for configuration of the monitoring devices. Similar to the PAPI [4] library for hardware counters, epapi is capable of generating statistical numbers on individual events like cache hits and misses. Besides this, epapi generates access histograms recording the occurrence of single events over the complete working set at granularity of cache lines. These histograms can be used to find critical regions where individual metrics, like cache locality, show a poor performance. As an information request interface, the main function of MRI is to allow tools and applications to specify runtime requests that hold definitions of the required information, like information type, source code region, and name of the concerned data structure. The runtime information type can be individual events and access histograms in combination with program regions. Typical program regions are program units, loops, parallel regions, and function call sites. According to the requests, the MRI layer calls appropriate epapi functions and delivers the information to the consumer via a push and/or pull interface. The changes needed in Figure 1, which shows the final design with hardware cache monitor(s), to get the runtime simulation approach we describe in this paper, are 1. replacing the Low Level Monitor API by the cache

5 simulator, which is now writing the event data to the ring buffers instead of the hardware monitors, and 2. replacing the hardware monitors by the Valgrind layer with the MemAccess instrumentation plugin, which is now generating the events instead of signals on cache chips. In the simulation approach, there is no need for a real SMP system, as multiple cache hierarchies are simulated, too. 5 Results Based on the OpenMP simulation environment described above, we have studied a set of applications from various benchmark suites. As a case study, the MG code within the NAS OpenMP Benchmark suite [6] is chosen in this section to show a few sample uses of the simulation tool. MG uses a multigrid algorithm to compute the solution of the three-dimensional scalar Poisson equation. The algorithm works on a set of grids ranging from coarse to fine resolution, with the latter being the dominant factor for computation time. For this study, MG is simulated with 32x32x32 being the finest grid resolution. The simulated target architecture is a 4-node Symmetric Multiprocessor (SMP) system, each processor node with a 16KB, 2-way L1 cache and a 512KB, 4-way L2 cache. Caches are maintained coherent using the MESI protocol that invalidates cache copies in all processors at each write operation. First, the simulation environment provides results about various events for a single simulated run. These events include hits, misses, or total accesses with respect to read, write, or both of them. For example, LC1 READ HIT is an event that gives the number of the read operations performed on the level 1 cache. Events can be specified for the complete program or restricted to a single code segment or memory area. This allows to study critical regions and arrays in the working set that cause performance problems. For the C version of the MG code, compiled with omcc 2, Table 1 shows the absolute number in terms of several events listed on the first column. It can be seen that only a small number of L1 misses happen on the whole system, with a miss rate of 2.4%, 2.2%, 2.6%, and 2.3% for each processor. It also can be seen that processor 1 performs less invalidations and less L1 misses than other processors. This indicates that cache coherence parameters correlate with the locality characteristics of the code. For the second level cache, a significant amount of misses can be observed with an average miss rate of 11% for all processors, compared to the number of L2 misses. Again, cache line invalidation is 2 omcc is from the Omni OpenMP compiler suite, using GNU gcc as backend. a critical reason for these misses. The numbers in the last line of Table 1 clearly show this. Besides individual events such as those listed in Table 1, the OpenMP simulation environment provides information in the form of memory access histograms at granularity of memory lines. Figure 2 illustrates such a histogram that records the access location of the first 100 lines of array v in the MG code. As illustrated in Figure 2, the access histogram depicts the accesses to the working set at granularity of cache line size. The x-axis shows the accessed locations, while the corresponding numbers of accesses to L1, L2, and the local memory are presented on the y-axis. Since it clearly exhibits the different behavior of each location in the memory hierarchy, this histogram is capable of directing the user to detect an optimized data placement for data structures within the source code. Finally, the simulation environment allows to study the impact of different compilers and programming languages with respect to an individual application. As an example, we give the simulation results of MG with both a C version and a Fortran version, compiled using the Omni OpenMP compilers omcc and omf77 [10]. The Fortran version also was compiled with the Intel compiler ifc. Table 2 presents the real execution times of both versions and different compilers, run on a 1.5 GHz Athlon, together with simulation results. The real execution times become small for reasonable simulation times; currently, the slowdown factor is around 1000 unfortunately. The overhead comes mainly from the simulation of the hardware monitors, since all memory references have to be traced and counter arrays to be updated depending on configuration. Still, it can be seen that the Intel Fortran version runs 3 times as fast as the Omni Fortran version, and the Omni C version is inbetween. This can be explained by the low number of memory references issued by the Intel compiler, which is only 30% of those issued by the Omni C compiler and 22% of those issued by the Omni Fortran compiler. Obviously, memory access characteristics correlate quite strong with real execution times. 6 Related Work The basic concept of runtime instrumentation has been used by several tools. An example is Shade [5]. Other techniques for seamless instrumentation of binaries are ATOM [7], which does binary rewriting, and DynInst [3], which can patch binaries while running. Runtime instrumentation is most often used for debugging purposes and analysis tools in software development. Other usages are processor migration and virtualization, hardware fault injection or even runtime optimizations. A set of research work has been reported regarding cache

6 Proc. 0 Proc. 1 Proc. 2 Proc. 3 L1 ACCESS TOTAL L1 ACCESS MISS L1 INVALIDATION L2 ACCESS MISS L2 INVALIDATION Table 1. Results on single events of the C version Figure 2. Access histogram of array v (first 100 memory lines) simulators. They are used either as a single tool such as Valgrind or combined with a complete architecture simulator. Examples are RSIM and MICA. RSIM [13], the Rice Simulator for ILP Multiprocessor, simulates shared memory multiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level parallelism (ILP). RSIM models a two-level cache hierarchy, but supports only a hardware directory-based cache coherence protocol. MICA [8] is a memory and network simulation environment for cache-based PC platforms. MICA extends Limes, a multiprocessor simulator, with the simulation of L2 caches and uses it to create a trace file including all memory references. However, MICA s cache simulator is quite simple and does not enable the modeling of various caches. In addition, most of the multiprocessor simulators use special macros for parallelization, and none of them supports OpenMP executions in contrast to our simulator. Many simulation systems have been developed over the last years. Prominent systems include the most comprehensive simulation tool SimOS from Stanford university, the memory oriented SIMICS from Swedish Institute of Computer Science, and the Wisconsin Wind Tunnel (WWT) from university of Wisconsin. These larger simulators model the computer hardware in high detail, allowing to evaluate the complete parallel architectures. All developed systems, however, are designed for specific architectures and purposes, and can not be straightforwardly used for our research on memory systems. Besides this, no system supports the execution of OpenMP programs. Modern rocessors are equipped with hardware counters that can also supply information about memory accesses. For this, platform independent libraries like PAPI [4] and PCL [2] are available for reading these counters. However, hardware counters can only provide limited access to special events. In contrast, our approach offers more insight into e.g. the access histogram allowing to detect regions and reasons causing locality problems. 7 Summary and Future Research In this paper we have introduced a flexible, general cache simulator for understanding the cache access behavior of OpenMP applications. This work is based on a runtime instrumentation tool and a multiprocessor simulator. We have modified SIMT s cache simulator to be driven by events generated using the Valgrind framework, and thus, we have

7 execution time total access total L1 miss C (omcc) 0.06s Fortran (ifc) 0.03s Fortran (omf77) 0.09s Table 2. Different simulation results with programming languages and compilers established a compiler independent simulation environment for OpenMP. This environment is capable of providing detailed information about the runtime cache access behavior. In combination with a data processing infrastructure, this information allows to detect critical code segments, data structures, and memory regions that cause cache locality problems. However, this work is currently still in its initial phase. The original Augmint approach has one benefit. It is easy to extend the instrumentation to deliver additional source information like bounds of used data structures. This data is needed to handle data-symbol oriented monitoring requests, which is a crucial feature of the MRI layer of our monitoring infrastructure. Within Valgrind, however, this information has to be fetched from debug information included in the binary. This is especially tricky for dynamically allocated memory. Further improvement lies on the simulation infrastructure. One missing piece is intelligent mapping of running threads to processors, as we simulate separate cache hierarchies per processor. Currently, static round robin is supported. This only gives useful results when there are as many processors as threads, and all threads are supposed to have enough work all the time. If these assumptions do not hold true, there will be a large discrepancy between reality and simulation results. Another issue which has to be resolved for practical reasons, is the current slowdown of the implemented simulator. There is some potential for improvements: e.g., a possibility seems to parallelize the simulation process itself. Acknoledgements We would like to thank Julian Seward for his excellent runtime instrumentation framework Valgrind, which made the work described in this paper possible. References [1] ADAPTOR. High Performance Fortran Compilation System, Available at [2] R. Berrendorf and B. Mohr. PCL - The Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors. Version 2.2. Online PCL documentation. [3] B.P.Miller, M. Callaghan, and e. a. J.M. Cargille. The Paradyn Parallel Performance Measurement Tool. IEEE Computer, 28(11):37 46, November [4] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 14(3): , Fall [5] B. Cmelik and D. Keppel. Shade: A Fast Instruction Set Simulator for Execution Profiling. In SIGMETRICS, Nashville, TN, US, [6] D. B. et. al. The NAS Parallel Benchmarks. Technical Report RNR , Department of Mathematics and Computer Science, Emory University, March [7] A. Eustace and A. Srivastava. ATOM: A Flexible Interface for Building High Performance Program Analysis Tools, [8] H. C. Hsiao and C. T. King. MICA: A Memory and Interconnect Simulation Environment for Cache-based Architectures. In Proceedings of the 33rd IEEE Annual Simulation Symposium (SS 2000), pages , April [9] Intel Corporation. Intel Fortran Compiler for Linux. Available at compilers/flin/. [10] K. Kusano, S. Satoh, and M. Sato. Performance Evaluation of the Omni OpenMP Compiler. In Proceedings of International Workshop on OpenMP: Experiences and Implementations (WOMPEI), volume 1940 of LNCS, pages , [11] N. Nethercote and J. Seward. Valgrind: A Program Supervision Framework. In Proceedings of the Third Workshop on Runtime Verification (RV 03), Boulder, Colorado, USA, July Available at sewardj. [12] A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures. In Proceedings of 1996 International Conference on Computer Design, October [13] V. S. Pai, P. Ranganathan, S. V. Adve, and T. Harton. An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP Processors. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 12 23, October [14] J. Tao, M. Schulz, and W. Karl. A Simulation Tool for Evaluating Shared Memory Systems. In Proceedings of the 36th ACM Annual Simulation Symposium, pages , Orlando, Florida, April 2003.

doctor augmented assembly code x86 assembler link Linker link Executable

doctor augmented assembly code x86 assembler link Linker link Executable A Cache Simulation Environment for OpenMP Jie Tao 1, Thomas Brandes 2,andMichael Gerndt 1 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation 2 Fraunhofer-Institute for Algorithms Institut für Informatik,

More information

Guided Prefetching Based on Runtime Access Patterns

Guided Prefetching Based on Runtime Access Patterns Guided Prefetching Based on Runtime Access Patterns Jie Tao 1, Georges Kneip 2, and Wolfgang Karl 2 1 Steinbuch Center for Computing Forschungszentrum Karlsruhe Karlsruhe Institute of Technology, Germany

More information

Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching

Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching Josef Weidendorfer 1 and Carsten Trinitis 1 Technische Universität München, Germany {weidendo,trinitic}@cs.tum.edu Abstract.

More information

Collecting and Exploiting Cache-Reuse Metrics

Collecting and Exploiting Cache-Reuse Metrics Collecting and Exploiting Cache-Reuse Metrics Josef Weidendorfer and Carsten Trinitis Technische Universität München, Germany {weidendo, trinitic}@cs.tum.edu Abstract. The increasing gap of processor and

More information

Performance Cockpit: An Extensible GUI Platform for Performance Tools

Performance Cockpit: An Extensible GUI Platform for Performance Tools Performance Cockpit: An Extensible GUI Platform for Performance Tools Tianchao Li and Michael Gerndt Institut für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching bei Mu nchen,

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Sparse Matrix Operations on Multi-core Architectures

Sparse Matrix Operations on Multi-core Architectures Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für

More information

Survey of Dynamic Instrumentation of Operating Systems

Survey of Dynamic Instrumentation of Operating Systems Survey of Dynamic Instrumentation of Operating Systems Harald Röck Department of Computer Sciences University of Salzburg, Austria hroeck@cs.uni-salzburg.at July 13, 2007 1 Introduction Operating systems

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

The PAPI Cross-Platform Interface to Hardware Performance Counters

The PAPI Cross-Platform Interface to Hardware Performance Counters The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Tracing the Cache Behavior of Data Structures in Fortran Applications

Tracing the Cache Behavior of Data Structures in Fortran Applications John von Neumann Institute for Computing Tracing the Cache Behavior of Data Structures in Fortran Applications L. Barabas, R. Müller-Pfefferkorn, W.E. Nagel, R. Neumann published in Parallel Computing:

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Monitoring Cache Behavior on Parallel SMP Architectures and Related Programming Tools

Monitoring Cache Behavior on Parallel SMP Architectures and Related Programming Tools Monitoring Cache Behavior on Parallel SMP Architectures and Related Programming Tools Thomas Brandes, Helmut Schwamborn Institut für Algorithmen und Wissenschaftliches Rechnen (SCAI) Fraunhofer Gesellschaft

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind Parallel Performance Analysis Course, 31 October, 2010 King Abdullah University of Science and Technology, Saudi Arabia Josef Weidendorfer Computer

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

A Cache Simulator for Shared Memory Systems

A Cache Simulator for Shared Memory Systems A Cache Simulator for Shared Memory Systems Florian Schintke, Jens Simon, and Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) {schintke,simon,ar}@zib.de Abstract. Due to the

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

A Trace-Scaling Agent for Parallel Application Tracing 1

A Trace-Scaling Agent for Parallel Application Tracing 1 A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune

Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune W. E. Cohen, R. K. Gaede, and J. B. Rodgers {cohen,gaede}@ece.uah.edu jrodgers@hiwaay.net

More information

Accelerating Dynamic Binary Translation with GPUs

Accelerating Dynamic Binary Translation with GPUs Accelerating Dynamic Binary Translation with GPUs Chung Hwan Kim, Srikanth Manikarnike, Vaibhav Sharma, Eric Eide, Robert Ricci School of Computing, University of Utah {chunghwn,smanikar,vaibhavs,eeide,ricci}@utah.edu

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS Kathy J. Richardson Technical Report No. CSL-TR-94-66 Dec 1994 Supported by NASA under NAG2-248 and Digital Western Research

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Chí Cao Minh 28 May 2008

Chí Cao Minh 28 May 2008 Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 20 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Pages Pages and frames Page

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1 Main Memory Electrical and Computer Engineering Stephen Kim (dskim@iupui.edu) ECE/IUPUI RTOS & APPS 1 Main Memory Background Swapping Contiguous allocation Paging Segmentation Segmentation with paging

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information