Sparse Matrix Operations on Multi-core Architectures

Size: px

Start display at page:

Download "Sparse Matrix Operations on Multi-core Architectures"

Erin Kelley
5 years ago
Views:

1 Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik Technische Universität München, Germany {Carsten.Trinitis,Tilman.Kuestner,Josef.Weidendorfer}@in.tum.de 2 ABB Corporate Research Center Baden-Daettwil, Switzerland Jasmin.Smajic@ch.abb.com Abstract. This paper compares various contemporary multi-core based microprocessor architectures with different memory interconnects regarding performance, speedup, and parallel efficiency. Sparse matrix operations are used as a benchmark application from the area of electrical engineering. Within this context, thread to core pinnning and cache optimization are two important aspects which are investigated in more detail. Keywords: Multi-core, pinning, cache optimization, performance optimization, sparse matrices. 1 Introduction Sparse matrix operations can be among of the most difficult applications in numerical simulation. Within this research field, an application from electrical engineering has been analyzed, utilizing various tools developed at Technische Universität München, namely within the Munich Multicore Initiative (MMI) 1. An OpenMP [6] based parallel version of the code was investigated with regard to possible performance improvements using MMI s tools. For NDA reasons with the project partner, all simulations were conducted with synthetic model data. Within recent years, a trend towards multi-core architectures with currently four cores for a standard x86 based architecture can be observed. In order to fully utilize the potential of such machines, applications must be parallelized and analyzed with regard to runtime, speedup, and parallel efficiency. With multicore architectures becoming more and more complex, it is of crucial importance to compare different hardware architectures with regard to how well they are suited for the application under investigation. Thus, six contemporary x86 based architectures have been investigated for a sparse matrix solver. The remainder 1 V. Malyshkin (Ed.): PaCT 2009, LNCS 5698, pp , c Springer-Verlag Berlin Heidelberg 2009

2 42 C. Trinitis et al. of this paper is organized as follows: Section 2 will give a brief introduction to sparse matrix problems, section 3 gives an overview on the hardware which was used for benchmarking, and section 4 presents and analyzes runtimes, speedup, and parallel efficiency obtained on these systems. Section 5 provides an analysis on the program s cache behavior, and section 6 concludes and gives an outlook on future work. 2 Sparse Matrix Operations Making a sparse matrix application scale is a fairly difficult task. In the case discussed in this paper, less than one per cent of the entries in the matrix are nonzero, and distributed according to a pattern given by the electrical engineering application. This makes it difficult to access the entries in the calculations in a way that is cache friendly. The first code developed for this particular application was developed in the mid eighties [8], when the primary focus in the code structure was to minimize the matrix bandwidth (and thus minimizing the number of floating point operations), i.e. keep the number of fill-ins to a minimum [2]. To minimize the number of floating point operations, i.e. keep the number of fill-ins to a minimum ( fill-ins means additional matrix entries created during the Gauss elimination or LDU-factorization that were zero in the original matrix). While it is still important to avoid unnecessary floating point operations, memory performance has improved in a much slower pace than CPUperformance, compared to when the initial code was written for this application. That means that the number of floating point operations may no longer be the biggest bottleneck when it comes to application scalability. This application uses the Markowitz criteria [4] to minimize the number of fill-ins during the LDU-factorization [8]. It performs a large number of sparse matrix operations that are totally independent of each other, i.e. it is not the LDU-operations that have been parallelized - it is the independent calculations that execute in parallel. 3 Hardware Environment Comprehensive benchmark tests were carried out on six different systems, with different hardware architectures. In the list below, the nickname of the system, followed by the processor type and amount of main memory are given. Nehalem 2 Intel Xeon X5570 (Gainestown, quad-core (8 cores with HyperThreading enabled), 2.93 GHz, 8 MB shared L3 cache, 2 QPI), 12 GB DDR3 RAM Dunnington 4 Intel Xeon X7460 (Dunnington, hexa-core, 2.66 GHz, 16 MB shared L3 cache), 1066 MHz FSB, 32 GB DDR2 RAM Shanghai 2 AMD Opteron 2376 (Shanghai, quad-core, 2.41 GHz, 6 MB shared L3 cache, 2 HyperTransport 3.0), 32 GB DDR3 RAM

Sparse Matrix Operations on Multi-core Architectures 43 Fig. 1. Front Side Bus based system Fig. 2. NUMA like system Barcelona 2 AMD Opteron 2352 (Barcelona, quad-core, 2.

60GHz, dual-core, 1 MB L3 cache per core, 3 HyperTransport), 64 GB DDR2 RAM Clovertown 2 Intel Xeon X5355 (Clovertown, quad-core, 2.

3 Sparse Matrix Operations on Multi-core Architectures 43 Fig. 1. Front Side Bus based system Fig. 2. NUMA like system Barcelona 2 AMD Opteron 2352 (Barcelona, quad-core, 2.11 GHz, 2 MB shared L3 cache, 2 HyperTransport 3.0), 16 GB DDR2 RAM X4600 Sun Fire X4600 M2: 8 AMD Opteron 8218 (Santa Rosa, 2.60GHz, dual-core, 1 MB L3 cache per core, 3 HyperTransport), 64 GB DDR2 RAM Clovertown 2 Intel Xeon X5355 (Clovertown, quad-core, 2.66 GHz, 4 MB L2 cache shared across two cores), 1333 MHz FSB, 8 GB DDR2 RAM As can be seen from this list, the processor architectures comprise the latest Intel architecture codenamed Nehalem as well the latest AMD processor Shanghai and some older processor types by both Intel and AMD. These systems represent two different architectures: A front side bus (FSB) based system, as depicted in figure 1, represented by the Clovertown and Dunnington systems. A NUMA like system, as depicted in figure 2, represented by the Nehalem, Barcelona, Shanghai, andx4600 systems. 4 Measurements As reported in previous research work carried out by the Munich Multicore Initiative at LRR-TUM, thread to core pinning does have a non negligible impact on parallel program performance on multicore architectures. The optimal pinnning can vary significantly, depending on the processor, on the overall system architecture as well as on the cache hierarchy [7], [5]. With the autopin tool developed

4 44 C. Trinitis et al. by MMI, thread to core pinning on all available architectures was thoroughly tested for our sparse matrix operations. Topinathreadtoaspecificprocessor core, autopin makes use of the system call sched_setaffinity. Thisprevents threads from moving between cores, which would result in poor cache usage. In some cases it is desirable to not use all cores on a chip, i.e. in order to avoid pinning to cores which share the same cache or which are located on the same chip. For details on cache usage see section 5. The first set of measurements focused on total program runtime. These were carried out on all six architectures with eight parallel threads. The pinning order used here was the optimal pinning order as determined by autopin, i.e. 0, 4, 1, 5, 2, 6, 3, 7, with the core numbers denoting the cores as depicted in figures 1 and 2. As can be seen from figure 3a, Intel s Nehalem architecture shows the best performance, followed by AMD Shanghai and Barcelona. Here, all available cores were utilized, i.e. one thread was pinned to each core. This also applies to the Clovertown system. For the Dunnington and X4600 systems (with 24 and 16 available cores, respectively), the optimal pinning for eight threads was determined with MMI s autopin tool. This turned out to be using one core of the dual core chips for the X4600 architecture, and two cores of the hexacore chips on the Dunnington architecture, such that they do not share a common L2 cache. (a) Total time (in seconds) with one (gray) and eight (black) threads (b) Speedup with eight threads Fig. 3. Total time (a) and speedup (b) Next, parallel efficiency was investigated by determining the speedup factors on all six architectures with the same pinning. For eight threads, most systems showed an average speedup of 5.0 to 5.5. The poorest parallel efficiency was measured on the Clovertown system with a speedup of only 2.6, whereas Dunnington system performed best with a speedup of 6.3. This good performance is attributed to its large 16 MB last-level cache. It must be noted, however, that

5 Sparse Matrix Operations on Multi-core Architectures 45 only 8 threads were run on a 4 6 core system, i.e. one third of the cores was utilized on each socket, allowing 8MB cache per core. When pinning a thread to all 24 cores, the efficiency drops down to 16%, which is due to the high load on the memory link. Hence, it could be found out with autopin that the application scales optimal on the Dunnington system when utilizing two cores per socket. Moreover, this also refers to the overall runtime: On the 24-core Dunnington system, optimal performance (i.e. runtime) was achieved with 8 threads at 2 threads per core. Therefore, it is advisable to investigate the target architecture an application is supposed to run on with regard to optimal pinning before utilizing all available cores, as additional cores do not deliver additional performance in certain cases. Fig. 4. Total runtime, parallel efficiency, and speedup for Shanghai Figure 3 depicts the total runtime for one and eight threads as well as the speedup for eight threads on all six architectures. Figures 4 and 5 compare total runtime, speedup, and parallel efficiency for the Shanghai and Nehalem systems in more detail. In these figures, the respective optimal pinning, starting from 1 core, has been used. The pinning order is depicted in the diagrams. Thus, e.g. for the Shanghai (see fig. 4, 1 thread was pinned to core #0, 2 threads were pinned to cores #0 and #4, 3 threads to cores #0, #4, and #5, etc.. For these two latest AMD and Intel systems, parallel efficiency is above 90 per cent for two cores and above 80 per cent for up to four cores, which is due to the fact that threads are always pinned to cores in such a way that the available cache is used optimally.

6 46 C. Trinitis et al. Fig. 5. Total runtime, parallel efficiency, and speedup for Nehalem 5 Analysis of Cache Behavior In addition to the measurements presented in the previous section, the application s cache behavior was analyzed in order to spot further bottlenecks and obtain possible performance improvements. For simulating the application s cache usage, the tools Callgrind/KCachegrind [10],[9], which were developed in the DIME [1] project, and extended by MMI, were used for the investigations. Callgrind is part of the open-source project Valgrind [3], which consists of tools for correctness checking and profiling built on a infrastructure for dynamic runtime instrumentation. Three test cases were examined: A single thread running on a core with 4 MB L2 cache. Four threads running on a quad-core processor with 4 MB shared L2 cache (as on the systems investigated in previous sections). Four threads running on a quad core processor with 16 MB shared L2 cache. The main objective for these test cases was to determine data sharing characteristics of the given OpenMP parallelization. The test cases approximate the behavior on multi-core architectures with shared last-level caches realistically. The main difference between reality and the simulations carried out with MMI s cache analysis tools is that a shared last-level cache is normally realized as a third cache level, but this has no impact on the simulation results. The simulation showed almost exactly the same number of instruction fetches and data references in all test cases, when aggregated over all threads. This

7 Sparse Matrix Operations on Multi-core Architectures 47 Table 1. Simulated L2 cache misses 1 thread, 4 MB L2 4 threads, 4 MB L2 4 threads, 16 MB L2 Total 987,490,834 1,380,396,029 57,413,638 airflowb 174,203, ,674, ,145 lqdflowb 184,610, ,806, ,186 comes as no surprise as the same input data was used in all cases, but it also proofs the comparability of the L2 cache misses, which are displayed in table 1. Taking a look at the first row of the table above denoting the cache misses of the entire program, it can be noticed that there is a major drop when moving to larger cache size (i.e. 16MB). This means that this amount of cache is capable of holding the input data of the examined bus model. This also correlates to the good performance numbers on the Dunnington system. When moving from one to four threads sharing the smaller 4 MB cache, the increase in cache misses is not as tremendous. This is because a considerable amount of data can be shared across the threads. Also, the data shows that the parallelization does not increase the total memory space requirement in contrast to the sequential version. Clearly, shared caches are beneficial for the given application. Regarding potential performance bottlenecks in the code, the functions airflowb and lqdflowb were identified as the ones which caused the most cache misses. In test cases one and two (small L2 cache) these two function contribute to nearly a quarter (22% to 24%) of all cache misses. In simulations carried out for 16 MB cache, however, they only account for 5.6% of the total misses. 6 Conclusions and Future Work Six contemporary multicore architectures were compared with a parallel reference application for sparse matrix solvers. The application was parallelized with a shared memory model under OpenMP. Thread to core pinning and cache otimization were investigated with regard to the application under consideration. The investigations showed, that, depending on the processor architecture as well as on the memory interconnect, it is not always advisable to utilize all available core in a system. With regard to cache optimization, is has been shown that potential bottlenecks can be easily detected with MMI s simulation tools. References 1. DiME DFG Project, Web Page, 2. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: ACM Annual Conference/Annual Meeting, Proceedings of the 24th national conference, pp (1969) 3. The Valgrind Developers. Valgrind Web Page,

8 48 C. Trinitis et al. 4. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, Oxford (1986) 5. Klug, T., Ott, M., Weidendorfer, J., Trinitis, C.: autopin - automated optimization of thread-to-core pinning on multicore system 3 (2008) 6. OpenMP.org. The OpenMP API specification for parallel programming, 7. Ott, M., Klug, T., Weidendorfer, J., Trinitis, C.: autopin - automated optimization of thread-to-core pinning on multicore systems. In: First Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG). Workshop proceedings, 1st Multiprog workshop, Gothenburg, Sweden (January 2008) 8. Tinney, W.F., Brandwajn, V., Chan, S.M.: Sparse vector methods. IEEE Transactions on Power Apparatus and Systems PAS-104(2) (February 1985) 9. Weidendorfer, J.: KCachegrind Web Page, Weidendorfer, J., Kowarschik, M., Trinitis, C.: A tool suite for simulation based analysis of memory access behavior. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS LNCS, vol. 3038, pp Springer, Heidelberg (2004)

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für