Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures

Size: px

Start display at page:

Download "Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures"

Camron McKinney
5 years ago
Views:

1 Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Ahmed S. Mohamed Department of Electrical and Computer Engineering The George Washington University Washington, DC Abstract: We experiment with various techniques of monitoring and tuning UPC programs while porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. In fact, the SGI NUMA environment has provided new opportunities for UPC. For example, the spectrum of performance analysis and profiler tools within the SGI NUMA environment made the development of new monitoring and tuning strategies that aim at improving the efficiency of parallel UPC applications possible. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to further optimize UPC programs with a better data and threads layouts potentially resulting in significant performance improvements. Furthermore, the SGI CC-NUMA environment provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by allowing, for example, two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization operations. This characteristic alleviates the effect of false sharing. Key Words: UPC, NAS, Latency, Privatization 1- Introduction Unified Parallel C (UPC) is an explicit parallel extension of ANSI C programming language designed for high performance computing on large-scale parallel machines [1]. It s two primary advantages are that programmers can get very close to hardware so that they can easily attain optimized performance, and that there is lots of C code that can be parallelized. The language provides a uniform programming model for both shared and distributed memory hardware since it is based on a distributed shared memory programming model. The communication model in it is based on the idea of a shared, partitioned address space, where variables may be directly read and written by multiple processors, but each variable is physically associated with a single processor. UPC uses a Single Program Multiple Data (SPMD) model of computation in which the amount of parallelism is fixed at program startup time, typically with a single thread of execution per processor. The philosophy behind UPC, is that you start with C; keep all its powerful concepts and features. Then add parallelism capitalizing on the experiences gain from previous and current parallel C, such as Split-C[2], Cilk [3], and AC [4]. The performance of a UPC parallel program depends on five main factors: (i) whether the parallelism within a program is explored/exposed efficiently to its potential. It is always important that a parallel developer explores/exposes parallelism in a program to the extreme so that a multiprocessor system can be highly utilized. Parallelism could be indicated explicitly by the UPC programmer, recognized by a parallel UPC compile, or discovered automatically by the UPC runtime system. (ii) whether the tasks to be performed are evenly allocated among processors. Load balancing is one of the key factors that affect system performance. The motivation of load balancing is to reduce the average completion time of processes and improve the utilization of processors. (iii) whether the UPC shared data used by all parallel processes/threads is rationally distributed among memories of processors. If a process/thread accesses a remote memory very frequently the proportion of communication overhead to the execution time will be relatively high so that the parallel performance is worsened. (vi) whether the running tasks are not cooperating as a team and thus creating much synchronization overhead. If processes/threads are waiting for each other (barriers) or waiting on data locks too much, we need to restructure our computation/data

2 layout. (v) whether UPC I/O and file system present a bottleneck in performance. Many parallel applications require high-performance I/O to avoid negating some or efficient use of an underlying parallel file system by allowing parallel applications to describe complex I/O requests at a high level. UPC programs therefore need extensive tuning with respect to the above five factors. This requires, however, that monitoring information about a UPC application behavior be collected. While the performance of message passing codes can easily be assessed and optimized using standard instrumentation tools (e.g. MPI-trace), the same task for the distributed shared memory UPC environment is much more difficult. This stems from the fact that any distributed shared memory communication is performed at runtime through transparently issued store and load operations to remote data locations and handled completely by the NUMA hardware. In addition, distributed shared memory communication is very fine grain, making a code instrumentation recording each global memory operation practically infeasible as it would slow down the execution significantly and thereby distort the final monitoring to a point where it is unusable for an accurate performance analysis. In this paper we propose a monitoring and tuning strategy that aims at improving the efficiency of parallel UPC applications. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to detect the communication bottlenecks and further optimize programs with a better data and code layouts potentially resulting in significant performance improvements. The problem is that all the five factors affecting the performance of parallel programs interact in complex ways so as to make the job of tuning at the high level a difficulty. Also analyzing the low-level hardware counters for diagnosing high-level parallel code is not so easy to do. Efficient distributed shared memory programming is one of the main motivations of the UPC project. In order to achieve this goal, both the low-level hardware and the high-level mechanisms have to be taken into consideration. Figure 1 shows the various layers involved in both UPC program development and execution. Our objective is to collect low level monitoring information from lower layers, and after analyzing these information be able to tune UPC programs at higher layers. In this work, we make use of SGI NUMA monitoring facilities (e.g. programs profilers, IRIX system calls or services) to get performance counters such as the all of the benefit derived from parallelizing its computation. The UPC I/O interface should enable number of cache misses, process and thread management statistics, locks and synchronization information such as mean and peak lock time, peak barrier wait time, lowlevel data transfer statistics, the detailed access numbers to pages and memory regions, communication hot spots statistics, the access behavior histograms on virtual pages. Inappropriate data allocation can, for example, be easily detected via the different height of paging histogram s columns. For example, Page 300 (virtual number) is located on processor 1 s memory but accessed only by processor 2. It is therefore incorrectly allocated and should be placed on processor 2 s memory. Monitoring is a direct and efficient way to exhibit where the program spends most of its execution time: time spent resolving cache misses, time spent waiting at barriers, time spent waiting on remote accesses, time spent waiting for I/O, and time spent waiting for locks, etc. In UPC, we assume that the programmer can specify the correct code/data layout for his application. A UPC programmer is always given a list of possible optimizations hints such as: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. In this work we would like to verify such optimization hints and come up with new more justifiable ones. We do this while reporting on our experience in porting NAS NPB benchmark on the lately developed GCC-SGI UPC compiler. Also the performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. 2- Experimental work The test-bed machine that we are running the experimental work on is an O3800 NUMA 32 MIPS processors, 500MHZ R14000, with 4MB 2-way set associative 32KB instruction/ 32KB data cache each, with 8MB secondary caches, 8 GB SDRAM, 3.2 GB/s total memory bandwidth, running IRIX 6.5. For the Compaq comparison we used a Compaq AlphaServer SC of 32 Alpha EV68 processors, 833 MHZ, each with 64 KB instruction cache, 64 KB data cache, and 8MB per processor of secondary cache, 24 GB SDRAM, 5.2 GB/s total memory bandwidth, running Tru64 Unix.

3 2.1 Porting NAS NPB-2.3 on the GCC-SGI UPC V1.10 Compiler The GCC-SGI UPC V1.10 compiler implementation effort has been undertaken lately [3]. UPC had two previous compilers: Compaq UPC compiler, and Cray UPC compiler. The Origin UPC compiler is based on a modified version of the gcc C compiler that calls specific functions for each basic UPC operation. The gcc UPC toolset provides a compilation and execution environment for UPC programs on the Origin NUMA machine. The current version extends the capabilities of GNU gcc (version ) compiler. The gcc UPC compiler is implemented as a C Language dialect translator, in a fashion similar to the implementation of the GNU Objective C compiler. This release was developed exclusively for use on SGI workstations and servers running the MIPS instruction set, the IRIX (release 6.5 or higher) operating system, and the MIPS2 32-bit ABI. Supported systems include Origin 2000 super-servers, and Octane workstations. By default, this release of the gcc UPC compiler would support systems with as many as 256 independent processing units. We made use of this compiler in the experimental section. Figure 2 in the appendix shows execution time of NAS NPB-2.3 benchmark on the Origin 3800 testbed using the GCC-SGI UPC compiler. The benchmark was tested on 1,4,9,16, 25 and 32 processors. The first number in each experiment represents the execution time of the given program, and the second number represent the time spent on the communication routines. For the kernels (CG,EP,FT,IS, and MG) only collective functions take place in this communication layer. For the applications (BT, LU, and SP), the second number is the time spent on the upc_memget(), and upc_memput() routines. Figure 3 shows the same benchmark on the Compaq machine. The same two numbers computed in each Origin experiment were also computed here. The execution times on one processor using plain C code on both the Origin and Compaq machines are used to measure the speedups drawn in figures 2 and 3. Speedups in figure 2 are scaled with respect to the single processors execution times. For NP processors, speedup is computed as T(ref)/T(NP), where T(ref) is taken from first column values. The speedups in figure 3 are computed as T(ref)/T(NP) for an NP processor run, with T(ref). The SGI UPC shows better performance than the Compaq UPC. 2.2 Low-level Monitoring and High-level Tuning of UPC programs Figure 4 shows the execution time of three Micro-Kernel NAS NPB benchmarks: Matrix Multiplication, Sobal Edge Detection, and N-Queens under 1,2,4,8, and 16 processors on the NUMA O3800 testbed with and without the following performance tuning and adjustments: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). Here we run two experiments; one before the optimization and another after. b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. Two experiments are demonstrated one before and another after the optimization. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. Two experiments are demonstrated one before and another after the optimization. d- Block pre-fetching: use block gets and puts. Two experiments are demonstrated one before and another after the optimization. Figure 4 draws the total runtime as our measure of performance enhancement due to various optimizations in the above benchmarking. The figure shows that both full optimization and pointer optimization hints outperform no optimization versions. Table 3 breaks these execution times into contributions from user time (which includes all local consistency time), time spent waiting at barriers, non-overlapped time spent waiting on faults, and non-overlapped time spent waiting for locks. Table 3 display hardware counters that were collected from a number of system profiling tools found on the Origin machine. These profiling tools are: Prof: profiling tool of the time spent on both user and runtime system routines. Speedshop (earlier SGI Pixie): to collect data from hardware counters with very little overhead for every processor, it can be used to detect possible load imbalance or extreme losses due to parallel overhead. Perfex: get the number of events that occur in the execution of a program such as TLB misses, cache misses, etc. Co-Pilot is a graphical interface that displays the activity in the various parts of the system during the execution of a parallel program. The NAS parallel benchmarks (NPB) is developed by the Numerical Aerodynamic simulation (NAS) program at NASA Ames Research Center for the performance evaluation of parallel supercomputers. The NPB mimics the computation and data movement characteristics of large scale computation fluid dynamics (CFD) applications. The NPB comes in two flavors NPB 1 and NPB 2. The NPB 1 are the original "pencil and paper" benchmarks. Vendors and others implement the detailed specifications in the NPB 1 report, using algorithms and programming models appropriate to their different machines. On the other hand NPB 2 are MPI-based source-code implementations written and distributed by NAS. They are intended to be run with little or no tuning. Another implementation of NPB 2 is the NPB 2- serial; these are single processor (serial) source-code implementations derived from the NPB 2 by removing all parallelism [NPB]. We have therefore used NPB 2 in our MPI execution time measurements. NPB 2-serial was used to provide the uniprocessor performance when

4 reporting on the scalability of MPI. The NPB suite consists of five kernels (EP, MG, FT, CG, IS) and three pseudo-applications (LU, SP, BT) programs. The bulk of the computations is integer arithmetic in IS. The other benchmarks are floating-point computation intensive. A brief description of each workload is presented in this section. BT (Block Tri-diagonal) is a simulated CFD application that uses an implicit algorithm to solve 3- dimensional (3-D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension. BT uses coarse-grain communications. SP (Scalar Penta-diagonal) is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam- Warmingapproximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. SP uses coarse-grain communications. LU (Block Lower Triangular) : is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve a seven-blockdiagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. LU performs a large number of small communications (five words) each. FT (Fast Fourier Transform): This benchmark solves a 3D partial differential equation using an FFT-based spectral method, also requiring long range communication. FT performs three one-dimensional (1- D) FFT s, one for each dimension. MG (MultiGrid): The MG benchmark uses a V-cycle multigrid method to compute the solution of the 3-D scalar Poisson equation. It performs both short and long range communications that are highly structured. CG (Conjugate Gradient): This benchmark computes an approximation to the smallest eigenvalue of symmetric positive definite matrix. This kernel features unstructured grid computations requiring irregular longrange communications. EP (Embarrassingly Parallel): This benchmark can run on any number of processors with little communication. It estimates the upper achievable limits for floating point performance of a parallel computer. This benchmark generates pairs of Gaussian random deviates according to a specific scheme and tabulates the number of pairs in successive annuli. IS (Integer sorting): This benchmark is a parallel sorting program based on bucket sort. It requires a lot of total exchange communication. There are different versions/classes of the NPB like Sample, Class A, Class B and Class C. These classes differ mainly in the size of the problem. Tables 3 and 4 give the problem sizes and performance rates (measured in Mflop/s) for each of the eight benchmarks, for Class A and Class B problem sets on a single processor Cray YMP. The following table shows the NPB problem sizes: (a) Table 1: Class A workloads (smaller version): Benchmark Size Operations(x10 3) MFLOPS EP MG CG 14, FT x IS 2 23 x LU SP BT (b) Table 2: Class B workloads (Bigger version) Benchmark Size Operations(x10 3 )MFLOPS EP MG CG 75, FT 512 x IS 2 25 x LU SP BT We have a number of results in table 3. First the hardware counters are compared between the no optimization version versus the counters for the full optimization version. We show the case for the matrix multiplication for NP=1 and for NP=4. The statistics for NP=4 represent the counters of one of the 4 processors. Careful analysis of these counters reveals a number of important observations. For example for NP=1 matrix multiplication, the number of loads and stores are less by two order of magnitude in the optimized version. That means the optimization hints have led to less memory references. This has also been reflected on the cache. Although the number of cache misses are same order of magnitude, the quadwords written back from cache is one order of magnitude less in the optimized version. Also the prefetch primary data cache misses is four order of magnitude less in the optimized version. The total number of decoded instructions is one order of magnitude less in the optimized version. That means the processor had less number of instructions to execute in the optimized version. The number of executed prefetch instructions is four order of magnitude larger in the optimized version. That means those instructions had less execution time in the optimized version. The number of mispredicted branches is one order of magnitude less in the optimized version. That means the pipelines were less often exposed to pipeline hazards in the optimized version. For the NP=4 matrix multiplication the number of loads and stores are less by

5 an order of magnitude in the four processor case. This has also been reflected on the cache misses on both the level 1 and level caches. 3- Conclusion In this paper we have reported on our experience in porting NAS NPB benchmark using the lately developed GCC-SGI UPC compiler on an Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. We are currently working on an UPC-I/O reference implementation on top of ROMIO (MPI-I/O implementation on the SGI NUMA) and a UPC parallel I/O Test Suite for testing the reference implementation. The Test Suite is a set of tests of I/O performance written using UPC-I/O routines. In this suite, there are low-level class of tests and the matrix tests in the kernel class. 4- References [1] T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC Language Specifications V1.0 ( February, [2] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. Von Eicken, and Y. Yelick, Introduction to Split-C, University of California, Berkeley, [3] R.C. Miller, A Type-Checking Preprocessor for Cilk 2, A Multi-threaded C Language, Master s Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, [4] W. W.Carlson and J. M.Draper, Distributed Data Access in AC, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara, CA, July 19-21, 1995, pp Appendix (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 2 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 4 3 SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-SP-A COMPAQ-SP-A SEQ-LINEAR 1 SGI-FT-A COMPAQ-FT-A SEQ-LINEAR

6 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-FT-A COMPAQ-FT-A SEQ-LINEAR SGI-IS-A COMPAQ-IS-A SEQ-LINEAR SGI-FT-A COMPAQ-FT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR SGI-SP-A COMPAQ-SP-A SEQ-LINEAR EXECUTION TIME - MATRIX MULTIPLICATION EXECUTION TIME - SOBEL EDGE ALGORITHM Execution Time (sec) 3 2 Execution Time (sec) NO OPT. PTR OPT. FULL OPT. NO OPT. PTR OPT. FULL OPT.

7 Table 3:Low level monitoring for Matrix Multiplication Microkernel: No Optimization/All-Optimization NP=1 & No Optimization/All-Optimization NP=4 Based on 600 MHz IP0 MIPS R12000/R14000 CPU Event Counter Name No Opt. NP=1 All-Opt NP=1 No Opt. NP=4 All Opt. NP=4 =================================================================================================================== Cycles Executed prefetch instructions Decoded instructions Decoded loads Decoded stores Miss handling table occupancy Failed store conditionals Resolved conditional branches Quadwords written back from scache Correctable scache data array ECC errors Primary instruction cache misses Secondary instruction cache misses Instruction misprediction from scache way prediction table External interventions External invalidations ALU/FPU progress cycles Graduated instructions Prefetch primary data cache misses Graduated loads Graduated stores Graduated store conditionals Graduated floating point instructions Quadwords written back from primary data cache TLB misses Mispredicted branches Primary data cache misses Secondary data cache misses Data misprediction from scache way prediction table State of intervention hits in scache State of invalidation hits in scache Store/prefetch exclusive to clean block in scache Store/prefetch exclusive to shared block in scache Statistics ~opt NP=1 All NP=1 ~opt NP=4 All NP=4 =================================================================================== Graduated instructions/cycle Graduated floating point instructions/cycle Graduated loads & stores/cycle Graduated loads & stores/floating point instruction Mispredicted branches/resolved conditional branches Graduated loads /Decoded loads ( and prefetches ) Graduated stores/decoded stores Data mispredict/data scache hits Instruction mispredict/instruction scache hits L1 Cache Line Reuse L2 Cache Line Reuse L1 Data Cache Hit Rate L2 Data Cache Hit Rate L1--L2 bandwidth used (MB/s, average per process) Memory bandwidth used (MB/s, average per process) MFLOPS (average per process) Cache misses in flight per cycle (average) Prefetch cache miss rate inf

Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture

Performance Monitoring and Evaluation of a Implementation on a NUMA Architecture François Cantonnet, Yiyi Yao, Smita Annareddy, Ahmed S. Mohamed*, Tarek A. El-Ghazawi Department of Electrical and Computer