Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures
|
|
- Camron McKinney
- 5 years ago
- Views:
Transcription
1 Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Ahmed S. Mohamed Department of Electrical and Computer Engineering The George Washington University Washington, DC Abstract: We experiment with various techniques of monitoring and tuning UPC programs while porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. In fact, the SGI NUMA environment has provided new opportunities for UPC. For example, the spectrum of performance analysis and profiler tools within the SGI NUMA environment made the development of new monitoring and tuning strategies that aim at improving the efficiency of parallel UPC applications possible. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to further optimize UPC programs with a better data and threads layouts potentially resulting in significant performance improvements. Furthermore, the SGI CC-NUMA environment provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by allowing, for example, two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization operations. This characteristic alleviates the effect of false sharing. Key Words: UPC, NAS, Latency, Privatization 1- Introduction Unified Parallel C (UPC) is an explicit parallel extension of ANSI C programming language designed for high performance computing on large-scale parallel machines [1]. It s two primary advantages are that programmers can get very close to hardware so that they can easily attain optimized performance, and that there is lots of C code that can be parallelized. The language provides a uniform programming model for both shared and distributed memory hardware since it is based on a distributed shared memory programming model. The communication model in it is based on the idea of a shared, partitioned address space, where variables may be directly read and written by multiple processors, but each variable is physically associated with a single processor. UPC uses a Single Program Multiple Data (SPMD) model of computation in which the amount of parallelism is fixed at program startup time, typically with a single thread of execution per processor. The philosophy behind UPC, is that you start with C; keep all its powerful concepts and features. Then add parallelism capitalizing on the experiences gain from previous and current parallel C, such as Split-C[2], Cilk [3], and AC [4]. The performance of a UPC parallel program depends on five main factors: (i) whether the parallelism within a program is explored/exposed efficiently to its potential. It is always important that a parallel developer explores/exposes parallelism in a program to the extreme so that a multiprocessor system can be highly utilized. Parallelism could be indicated explicitly by the UPC programmer, recognized by a parallel UPC compile, or discovered automatically by the UPC runtime system. (ii) whether the tasks to be performed are evenly allocated among processors. Load balancing is one of the key factors that affect system performance. The motivation of load balancing is to reduce the average completion time of processes and improve the utilization of processors. (iii) whether the UPC shared data used by all parallel processes/threads is rationally distributed among memories of processors. If a process/thread accesses a remote memory very frequently the proportion of communication overhead to the execution time will be relatively high so that the parallel performance is worsened. (vi) whether the running tasks are not cooperating as a team and thus creating much synchronization overhead. If processes/threads are waiting for each other (barriers) or waiting on data locks too much, we need to restructure our computation/data
2 layout. (v) whether UPC I/O and file system present a bottleneck in performance. Many parallel applications require high-performance I/O to avoid negating some or efficient use of an underlying parallel file system by allowing parallel applications to describe complex I/O requests at a high level. UPC programs therefore need extensive tuning with respect to the above five factors. This requires, however, that monitoring information about a UPC application behavior be collected. While the performance of message passing codes can easily be assessed and optimized using standard instrumentation tools (e.g. MPI-trace), the same task for the distributed shared memory UPC environment is much more difficult. This stems from the fact that any distributed shared memory communication is performed at runtime through transparently issued store and load operations to remote data locations and handled completely by the NUMA hardware. In addition, distributed shared memory communication is very fine grain, making a code instrumentation recording each global memory operation practically infeasible as it would slow down the execution significantly and thereby distort the final monitoring to a point where it is unusable for an accurate performance analysis. In this paper we propose a monitoring and tuning strategy that aims at improving the efficiency of parallel UPC applications. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to detect the communication bottlenecks and further optimize programs with a better data and code layouts potentially resulting in significant performance improvements. The problem is that all the five factors affecting the performance of parallel programs interact in complex ways so as to make the job of tuning at the high level a difficulty. Also analyzing the low-level hardware counters for diagnosing high-level parallel code is not so easy to do. Efficient distributed shared memory programming is one of the main motivations of the UPC project. In order to achieve this goal, both the low-level hardware and the high-level mechanisms have to be taken into consideration. Figure 1 shows the various layers involved in both UPC program development and execution. Our objective is to collect low level monitoring information from lower layers, and after analyzing these information be able to tune UPC programs at higher layers. In this work, we make use of SGI NUMA monitoring facilities (e.g. programs profilers, IRIX system calls or services) to get performance counters such as the all of the benefit derived from parallelizing its computation. The UPC I/O interface should enable number of cache misses, process and thread management statistics, locks and synchronization information such as mean and peak lock time, peak barrier wait time, lowlevel data transfer statistics, the detailed access numbers to pages and memory regions, communication hot spots statistics, the access behavior histograms on virtual pages. Inappropriate data allocation can, for example, be easily detected via the different height of paging histogram s columns. For example, Page 300 (virtual number) is located on processor 1 s memory but accessed only by processor 2. It is therefore incorrectly allocated and should be placed on processor 2 s memory. Monitoring is a direct and efficient way to exhibit where the program spends most of its execution time: time spent resolving cache misses, time spent waiting at barriers, time spent waiting on remote accesses, time spent waiting for I/O, and time spent waiting for locks, etc. In UPC, we assume that the programmer can specify the correct code/data layout for his application. A UPC programmer is always given a list of possible optimizations hints such as: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. In this work we would like to verify such optimization hints and come up with new more justifiable ones. We do this while reporting on our experience in porting NAS NPB benchmark on the lately developed GCC-SGI UPC compiler. Also the performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. 2- Experimental work The test-bed machine that we are running the experimental work on is an O3800 NUMA 32 MIPS processors, 500MHZ R14000, with 4MB 2-way set associative 32KB instruction/ 32KB data cache each, with 8MB secondary caches, 8 GB SDRAM, 3.2 GB/s total memory bandwidth, running IRIX 6.5. For the Compaq comparison we used a Compaq AlphaServer SC of 32 Alpha EV68 processors, 833 MHZ, each with 64 KB instruction cache, 64 KB data cache, and 8MB per processor of secondary cache, 24 GB SDRAM, 5.2 GB/s total memory bandwidth, running Tru64 Unix.
3 2.1 Porting NAS NPB-2.3 on the GCC-SGI UPC V1.10 Compiler The GCC-SGI UPC V1.10 compiler implementation effort has been undertaken lately [3]. UPC had two previous compilers: Compaq UPC compiler, and Cray UPC compiler. The Origin UPC compiler is based on a modified version of the gcc C compiler that calls specific functions for each basic UPC operation. The gcc UPC toolset provides a compilation and execution environment for UPC programs on the Origin NUMA machine. The current version extends the capabilities of GNU gcc (version ) compiler. The gcc UPC compiler is implemented as a C Language dialect translator, in a fashion similar to the implementation of the GNU Objective C compiler. This release was developed exclusively for use on SGI workstations and servers running the MIPS instruction set, the IRIX (release 6.5 or higher) operating system, and the MIPS2 32-bit ABI. Supported systems include Origin 2000 super-servers, and Octane workstations. By default, this release of the gcc UPC compiler would support systems with as many as 256 independent processing units. We made use of this compiler in the experimental section. Figure 2 in the appendix shows execution time of NAS NPB-2.3 benchmark on the Origin 3800 testbed using the GCC-SGI UPC compiler. The benchmark was tested on 1,4,9,16, 25 and 32 processors. The first number in each experiment represents the execution time of the given program, and the second number represent the time spent on the communication routines. For the kernels (CG,EP,FT,IS, and MG) only collective functions take place in this communication layer. For the applications (BT, LU, and SP), the second number is the time spent on the upc_memget(), and upc_memput() routines. Figure 3 shows the same benchmark on the Compaq machine. The same two numbers computed in each Origin experiment were also computed here. The execution times on one processor using plain C code on both the Origin and Compaq machines are used to measure the speedups drawn in figures 2 and 3. Speedups in figure 2 are scaled with respect to the single processors execution times. For NP processors, speedup is computed as T(ref)/T(NP), where T(ref) is taken from first column values. The speedups in figure 3 are computed as T(ref)/T(NP) for an NP processor run, with T(ref). The SGI UPC shows better performance than the Compaq UPC. 2.2 Low-level Monitoring and High-level Tuning of UPC programs Figure 4 shows the execution time of three Micro-Kernel NAS NPB benchmarks: Matrix Multiplication, Sobal Edge Detection, and N-Queens under 1,2,4,8, and 16 processors on the NUMA O3800 testbed with and without the following performance tuning and adjustments: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). Here we run two experiments; one before the optimization and another after. b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. Two experiments are demonstrated one before and another after the optimization. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. Two experiments are demonstrated one before and another after the optimization. d- Block pre-fetching: use block gets and puts. Two experiments are demonstrated one before and another after the optimization. Figure 4 draws the total runtime as our measure of performance enhancement due to various optimizations in the above benchmarking. The figure shows that both full optimization and pointer optimization hints outperform no optimization versions. Table 3 breaks these execution times into contributions from user time (which includes all local consistency time), time spent waiting at barriers, non-overlapped time spent waiting on faults, and non-overlapped time spent waiting for locks. Table 3 display hardware counters that were collected from a number of system profiling tools found on the Origin machine. These profiling tools are: Prof: profiling tool of the time spent on both user and runtime system routines. Speedshop (earlier SGI Pixie): to collect data from hardware counters with very little overhead for every processor, it can be used to detect possible load imbalance or extreme losses due to parallel overhead. Perfex: get the number of events that occur in the execution of a program such as TLB misses, cache misses, etc. Co-Pilot is a graphical interface that displays the activity in the various parts of the system during the execution of a parallel program. The NAS parallel benchmarks (NPB) is developed by the Numerical Aerodynamic simulation (NAS) program at NASA Ames Research Center for the performance evaluation of parallel supercomputers. The NPB mimics the computation and data movement characteristics of large scale computation fluid dynamics (CFD) applications. The NPB comes in two flavors NPB 1 and NPB 2. The NPB 1 are the original "pencil and paper" benchmarks. Vendors and others implement the detailed specifications in the NPB 1 report, using algorithms and programming models appropriate to their different machines. On the other hand NPB 2 are MPI-based source-code implementations written and distributed by NAS. They are intended to be run with little or no tuning. Another implementation of NPB 2 is the NPB 2- serial; these are single processor (serial) source-code implementations derived from the NPB 2 by removing all parallelism [NPB]. We have therefore used NPB 2 in our MPI execution time measurements. NPB 2-serial was used to provide the uniprocessor performance when
4 reporting on the scalability of MPI. The NPB suite consists of five kernels (EP, MG, FT, CG, IS) and three pseudo-applications (LU, SP, BT) programs. The bulk of the computations is integer arithmetic in IS. The other benchmarks are floating-point computation intensive. A brief description of each workload is presented in this section. BT (Block Tri-diagonal) is a simulated CFD application that uses an implicit algorithm to solve 3- dimensional (3-D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension. BT uses coarse-grain communications. SP (Scalar Penta-diagonal) is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam- Warmingapproximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. SP uses coarse-grain communications. LU (Block Lower Triangular) : is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve a seven-blockdiagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. LU performs a large number of small communications (five words) each. FT (Fast Fourier Transform): This benchmark solves a 3D partial differential equation using an FFT-based spectral method, also requiring long range communication. FT performs three one-dimensional (1- D) FFT s, one for each dimension. MG (MultiGrid): The MG benchmark uses a V-cycle multigrid method to compute the solution of the 3-D scalar Poisson equation. It performs both short and long range communications that are highly structured. CG (Conjugate Gradient): This benchmark computes an approximation to the smallest eigenvalue of symmetric positive definite matrix. This kernel features unstructured grid computations requiring irregular longrange communications. EP (Embarrassingly Parallel): This benchmark can run on any number of processors with little communication. It estimates the upper achievable limits for floating point performance of a parallel computer. This benchmark generates pairs of Gaussian random deviates according to a specific scheme and tabulates the number of pairs in successive annuli. IS (Integer sorting): This benchmark is a parallel sorting program based on bucket sort. It requires a lot of total exchange communication. There are different versions/classes of the NPB like Sample, Class A, Class B and Class C. These classes differ mainly in the size of the problem. Tables 3 and 4 give the problem sizes and performance rates (measured in Mflop/s) for each of the eight benchmarks, for Class A and Class B problem sets on a single processor Cray YMP. The following table shows the NPB problem sizes: (a) Table 1: Class A workloads (smaller version): Benchmark Size Operations(x10 3) MFLOPS EP MG CG 14, FT x IS 2 23 x LU SP BT (b) Table 2: Class B workloads (Bigger version) Benchmark Size Operations(x10 3 )MFLOPS EP MG CG 75, FT 512 x IS 2 25 x LU SP BT We have a number of results in table 3. First the hardware counters are compared between the no optimization version versus the counters for the full optimization version. We show the case for the matrix multiplication for NP=1 and for NP=4. The statistics for NP=4 represent the counters of one of the 4 processors. Careful analysis of these counters reveals a number of important observations. For example for NP=1 matrix multiplication, the number of loads and stores are less by two order of magnitude in the optimized version. That means the optimization hints have led to less memory references. This has also been reflected on the cache. Although the number of cache misses are same order of magnitude, the quadwords written back from cache is one order of magnitude less in the optimized version. Also the prefetch primary data cache misses is four order of magnitude less in the optimized version. The total number of decoded instructions is one order of magnitude less in the optimized version. That means the processor had less number of instructions to execute in the optimized version. The number of executed prefetch instructions is four order of magnitude larger in the optimized version. That means those instructions had less execution time in the optimized version. The number of mispredicted branches is one order of magnitude less in the optimized version. That means the pipelines were less often exposed to pipeline hazards in the optimized version. For the NP=4 matrix multiplication the number of loads and stores are less by
5 an order of magnitude in the four processor case. This has also been reflected on the cache misses on both the level 1 and level caches. 3- Conclusion In this paper we have reported on our experience in porting NAS NPB benchmark using the lately developed GCC-SGI UPC compiler on an Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. We are currently working on an UPC-I/O reference implementation on top of ROMIO (MPI-I/O implementation on the SGI NUMA) and a UPC parallel I/O Test Suite for testing the reference implementation. The Test Suite is a set of tests of I/O performance written using UPC-I/O routines. In this suite, there are low-level class of tests and the matrix tests in the kernel class. 4- References [1] T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC Language Specifications V1.0 ( February, [2] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. Von Eicken, and Y. Yelick, Introduction to Split-C, University of California, Berkeley, [3] R.C. Miller, A Type-Checking Preprocessor for Cilk 2, A Multi-threaded C Language, Master s Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, [4] W. W.Carlson and J. M.Draper, Distributed Data Access in AC, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara, CA, July 19-21, 1995, pp Appendix (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 2 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 4 3 SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-SP-A COMPAQ-SP-A SEQ-LINEAR 1 SGI-FT-A COMPAQ-FT-A SEQ-LINEAR
6 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-FT-A COMPAQ-FT-A SEQ-LINEAR SGI-IS-A COMPAQ-IS-A SEQ-LINEAR SGI-FT-A COMPAQ-FT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR SGI-SP-A COMPAQ-SP-A SEQ-LINEAR EXECUTION TIME - MATRIX MULTIPLICATION EXECUTION TIME - SOBEL EDGE ALGORITHM Execution Time (sec) 3 2 Execution Time (sec) NO OPT. PTR OPT. FULL OPT. NO OPT. PTR OPT. FULL OPT.
7 Table 3:Low level monitoring for Matrix Multiplication Microkernel: No Optimization/All-Optimization NP=1 & No Optimization/All-Optimization NP=4 Based on 600 MHz IP0 MIPS R12000/R14000 CPU Event Counter Name No Opt. NP=1 All-Opt NP=1 No Opt. NP=4 All Opt. NP=4 =================================================================================================================== Cycles Executed prefetch instructions Decoded instructions Decoded loads Decoded stores Miss handling table occupancy Failed store conditionals Resolved conditional branches Quadwords written back from scache Correctable scache data array ECC errors Primary instruction cache misses Secondary instruction cache misses Instruction misprediction from scache way prediction table External interventions External invalidations ALU/FPU progress cycles Graduated instructions Prefetch primary data cache misses Graduated loads Graduated stores Graduated store conditionals Graduated floating point instructions Quadwords written back from primary data cache TLB misses Mispredicted branches Primary data cache misses Secondary data cache misses Data misprediction from scache way prediction table State of intervention hits in scache State of invalidation hits in scache Store/prefetch exclusive to clean block in scache Store/prefetch exclusive to shared block in scache Statistics ~opt NP=1 All NP=1 ~opt NP=4 All NP=4 =================================================================================== Graduated instructions/cycle Graduated floating point instructions/cycle Graduated loads & stores/cycle Graduated loads & stores/floating point instruction Mispredicted branches/resolved conditional branches Graduated loads /Decoded loads ( and prefetches ) Graduated stores/decoded stores Data mispredict/data scache hits Instruction mispredict/instruction scache hits L1 Cache Line Reuse L2 Cache Line Reuse L1 Data Cache Hit Rate L2 Data Cache Hit Rate L1--L2 bandwidth used (MB/s, average per process) Memory bandwidth used (MB/s, average per process) MFLOPS (average per process) Cache misses in flight per cycle (average) Prefetch cache miss rate inf
Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture
Performance Monitoring and Evaluation of a Implementation on a NUMA Architecture François Cantonnet, Yiyi Yao, Smita Annareddy, Ahmed S. Mohamed*, Tarek A. El-Ghazawi Department of Electrical and Computer
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationCluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationAn Intelligent and Cost-effective Solution to Implement High Performance Computing
www.ijape.org International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 doi: 10.14355/ijape.2016.05.006 An Intelligent and Cost-effective Solution to Implement High Performance Computing
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationDEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES
DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES Tarek El-Ghazawi, François Cantonnet, Yiyi Yao Department of Electrical and Computer Engineering The George Washington University tarek@gwu.edu
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationArchitectural Requirements and Scalability of the NAS Parallel Benchmarks
Abstract Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler Computer Science Division Department
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationEarly Evaluation of the Cray X1 at Oak Ridge National Laboratory
Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square
More informationPerformance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster
Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationBenchmarking CPU Performance
Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationBuilding MPI for Multi-Programming Systems using Implicit Information
Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley
More informationA Local-View Array Library for Partitioned Global Address Space C++ Programs
Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationEvaluating Titanium SPMD Programs on the Tera MTA
Evaluating Titanium SPMD Programs on the Tera MTA Carleton Miyamoto, Chang Lin {miyamoto,cjlin}@cs.berkeley.edu EECS Computer Science Division University of California, Berkeley Abstract Coarse grained
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCost-Performance Evaluation of SMP Clusters
Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationCo-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationAdaptive Prefetching Technique for Shared Virtual Memory
Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationGivy, a software DSM runtime using raw pointers
Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples
More informationParallel Computer Architecture and Programming Written Assignment 3
Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationCS 1013 Advance Computer Architecture UNIT I
CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationChapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011
CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationAn Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language
An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationCommunication Characteristics in the NAS Parallel Benchmarks
Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationChapter 18. Parallel Processing. Yonsei University
Chapter 18 Parallel Processing Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2 Types
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More information