Sparse Matrix Operations on Multi-core Architectures
|
|
- Erin Kelley
- 5 years ago
- Views:
Transcription
1 Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik Technische Universität München, Germany {Carsten.Trinitis,Tilman.Kuestner,Josef.Weidendorfer}@in.tum.de 2 ABB Corporate Research Center Baden-Daettwil, Switzerland Jasmin.Smajic@ch.abb.com Abstract. This paper compares various contemporary multi-core based microprocessor architectures with different memory interconnects regarding performance, speedup, and parallel efficiency. Sparse matrix operations are used as a benchmark application from the area of electrical engineering. Within this context, thread to core pinnning and cache optimization are two important aspects which are investigated in more detail. Keywords: Multi-core, pinning, cache optimization, performance optimization, sparse matrices. 1 Introduction Sparse matrix operations can be among of the most difficult applications in numerical simulation. Within this research field, an application from electrical engineering has been analyzed, utilizing various tools developed at Technische Universität München, namely within the Munich Multicore Initiative (MMI) 1. An OpenMP [6] based parallel version of the code was investigated with regard to possible performance improvements using MMI s tools. For NDA reasons with the project partner, all simulations were conducted with synthetic model data. Within recent years, a trend towards multi-core architectures with currently four cores for a standard x86 based architecture can be observed. In order to fully utilize the potential of such machines, applications must be parallelized and analyzed with regard to runtime, speedup, and parallel efficiency. With multicore architectures becoming more and more complex, it is of crucial importance to compare different hardware architectures with regard to how well they are suited for the application under investigation. Thus, six contemporary x86 based architectures have been investigated for a sparse matrix solver. The remainder 1 V. Malyshkin (Ed.): PaCT 2009, LNCS 5698, pp , c Springer-Verlag Berlin Heidelberg 2009
2 42 C. Trinitis et al. of this paper is organized as follows: Section 2 will give a brief introduction to sparse matrix problems, section 3 gives an overview on the hardware which was used for benchmarking, and section 4 presents and analyzes runtimes, speedup, and parallel efficiency obtained on these systems. Section 5 provides an analysis on the program s cache behavior, and section 6 concludes and gives an outlook on future work. 2 Sparse Matrix Operations Making a sparse matrix application scale is a fairly difficult task. In the case discussed in this paper, less than one per cent of the entries in the matrix are nonzero, and distributed according to a pattern given by the electrical engineering application. This makes it difficult to access the entries in the calculations in a way that is cache friendly. The first code developed for this particular application was developed in the mid eighties [8], when the primary focus in the code structure was to minimize the matrix bandwidth (and thus minimizing the number of floating point operations), i.e. keep the number of fill-ins to a minimum [2]. To minimize the number of floating point operations, i.e. keep the number of fill-ins to a minimum ( fill-ins means additional matrix entries created during the Gauss elimination or LDU-factorization that were zero in the original matrix). While it is still important to avoid unnecessary floating point operations, memory performance has improved in a much slower pace than CPUperformance, compared to when the initial code was written for this application. That means that the number of floating point operations may no longer be the biggest bottleneck when it comes to application scalability. This application uses the Markowitz criteria [4] to minimize the number of fill-ins during the LDU-factorization [8]. It performs a large number of sparse matrix operations that are totally independent of each other, i.e. it is not the LDU-operations that have been parallelized - it is the independent calculations that execute in parallel. 3 Hardware Environment Comprehensive benchmark tests were carried out on six different systems, with different hardware architectures. In the list below, the nickname of the system, followed by the processor type and amount of main memory are given. Nehalem 2 Intel Xeon X5570 (Gainestown, quad-core (8 cores with HyperThreading enabled), 2.93 GHz, 8 MB shared L3 cache, 2 QPI), 12 GB DDR3 RAM Dunnington 4 Intel Xeon X7460 (Dunnington, hexa-core, 2.66 GHz, 16 MB shared L3 cache), 1066 MHz FSB, 32 GB DDR2 RAM Shanghai 2 AMD Opteron 2376 (Shanghai, quad-core, 2.41 GHz, 6 MB shared L3 cache, 2 HyperTransport 3.0), 32 GB DDR3 RAM
3 Sparse Matrix Operations on Multi-core Architectures 43 Fig. 1. Front Side Bus based system Fig. 2. NUMA like system Barcelona 2 AMD Opteron 2352 (Barcelona, quad-core, 2.11 GHz, 2 MB shared L3 cache, 2 HyperTransport 3.0), 16 GB DDR2 RAM X4600 Sun Fire X4600 M2: 8 AMD Opteron 8218 (Santa Rosa, 2.60GHz, dual-core, 1 MB L3 cache per core, 3 HyperTransport), 64 GB DDR2 RAM Clovertown 2 Intel Xeon X5355 (Clovertown, quad-core, 2.66 GHz, 4 MB L2 cache shared across two cores), 1333 MHz FSB, 8 GB DDR2 RAM As can be seen from this list, the processor architectures comprise the latest Intel architecture codenamed Nehalem as well the latest AMD processor Shanghai and some older processor types by both Intel and AMD. These systems represent two different architectures: A front side bus (FSB) based system, as depicted in figure 1, represented by the Clovertown and Dunnington systems. A NUMA like system, as depicted in figure 2, represented by the Nehalem, Barcelona, Shanghai, andx4600 systems. 4 Measurements As reported in previous research work carried out by the Munich Multicore Initiative at LRR-TUM, thread to core pinning does have a non negligible impact on parallel program performance on multicore architectures. The optimal pinnning can vary significantly, depending on the processor, on the overall system architecture as well as on the cache hierarchy [7], [5]. With the autopin tool developed
4 44 C. Trinitis et al. by MMI, thread to core pinning on all available architectures was thoroughly tested for our sparse matrix operations. Topinathreadtoaspecificprocessor core, autopin makes use of the system call sched_setaffinity. Thisprevents threads from moving between cores, which would result in poor cache usage. In some cases it is desirable to not use all cores on a chip, i.e. in order to avoid pinning to cores which share the same cache or which are located on the same chip. For details on cache usage see section 5. The first set of measurements focused on total program runtime. These were carried out on all six architectures with eight parallel threads. The pinning order used here was the optimal pinning order as determined by autopin, i.e. 0, 4, 1, 5, 2, 6, 3, 7, with the core numbers denoting the cores as depicted in figures 1 and 2. As can be seen from figure 3a, Intel s Nehalem architecture shows the best performance, followed by AMD Shanghai and Barcelona. Here, all available cores were utilized, i.e. one thread was pinned to each core. This also applies to the Clovertown system. For the Dunnington and X4600 systems (with 24 and 16 available cores, respectively), the optimal pinning for eight threads was determined with MMI s autopin tool. This turned out to be using one core of the dual core chips for the X4600 architecture, and two cores of the hexacore chips on the Dunnington architecture, such that they do not share a common L2 cache. (a) Total time (in seconds) with one (gray) and eight (black) threads (b) Speedup with eight threads Fig. 3. Total time (a) and speedup (b) Next, parallel efficiency was investigated by determining the speedup factors on all six architectures with the same pinning. For eight threads, most systems showed an average speedup of 5.0 to 5.5. The poorest parallel efficiency was measured on the Clovertown system with a speedup of only 2.6, whereas Dunnington system performed best with a speedup of 6.3. This good performance is attributed to its large 16 MB last-level cache. It must be noted, however, that
5 Sparse Matrix Operations on Multi-core Architectures 45 only 8 threads were run on a 4 6 core system, i.e. one third of the cores was utilized on each socket, allowing 8MB cache per core. When pinning a thread to all 24 cores, the efficiency drops down to 16%, which is due to the high load on the memory link. Hence, it could be found out with autopin that the application scales optimal on the Dunnington system when utilizing two cores per socket. Moreover, this also refers to the overall runtime: On the 24-core Dunnington system, optimal performance (i.e. runtime) was achieved with 8 threads at 2 threads per core. Therefore, it is advisable to investigate the target architecture an application is supposed to run on with regard to optimal pinning before utilizing all available cores, as additional cores do not deliver additional performance in certain cases. Fig. 4. Total runtime, parallel efficiency, and speedup for Shanghai Figure 3 depicts the total runtime for one and eight threads as well as the speedup for eight threads on all six architectures. Figures 4 and 5 compare total runtime, speedup, and parallel efficiency for the Shanghai and Nehalem systems in more detail. In these figures, the respective optimal pinning, starting from 1 core, has been used. The pinning order is depicted in the diagrams. Thus, e.g. for the Shanghai (see fig. 4, 1 thread was pinned to core #0, 2 threads were pinned to cores #0 and #4, 3 threads to cores #0, #4, and #5, etc.. For these two latest AMD and Intel systems, parallel efficiency is above 90 per cent for two cores and above 80 per cent for up to four cores, which is due to the fact that threads are always pinned to cores in such a way that the available cache is used optimally.
6 46 C. Trinitis et al. Fig. 5. Total runtime, parallel efficiency, and speedup for Nehalem 5 Analysis of Cache Behavior In addition to the measurements presented in the previous section, the application s cache behavior was analyzed in order to spot further bottlenecks and obtain possible performance improvements. For simulating the application s cache usage, the tools Callgrind/KCachegrind [10],[9], which were developed in the DIME [1] project, and extended by MMI, were used for the investigations. Callgrind is part of the open-source project Valgrind [3], which consists of tools for correctness checking and profiling built on a infrastructure for dynamic runtime instrumentation. Three test cases were examined: A single thread running on a core with 4 MB L2 cache. Four threads running on a quad-core processor with 4 MB shared L2 cache (as on the systems investigated in previous sections). Four threads running on a quad core processor with 16 MB shared L2 cache. The main objective for these test cases was to determine data sharing characteristics of the given OpenMP parallelization. The test cases approximate the behavior on multi-core architectures with shared last-level caches realistically. The main difference between reality and the simulations carried out with MMI s cache analysis tools is that a shared last-level cache is normally realized as a third cache level, but this has no impact on the simulation results. The simulation showed almost exactly the same number of instruction fetches and data references in all test cases, when aggregated over all threads. This
7 Sparse Matrix Operations on Multi-core Architectures 47 Table 1. Simulated L2 cache misses 1 thread, 4 MB L2 4 threads, 4 MB L2 4 threads, 16 MB L2 Total 987,490,834 1,380,396,029 57,413,638 airflowb 174,203, ,674, ,145 lqdflowb 184,610, ,806, ,186 comes as no surprise as the same input data was used in all cases, but it also proofs the comparability of the L2 cache misses, which are displayed in table 1. Taking a look at the first row of the table above denoting the cache misses of the entire program, it can be noticed that there is a major drop when moving to larger cache size (i.e. 16MB). This means that this amount of cache is capable of holding the input data of the examined bus model. This also correlates to the good performance numbers on the Dunnington system. When moving from one to four threads sharing the smaller 4 MB cache, the increase in cache misses is not as tremendous. This is because a considerable amount of data can be shared across the threads. Also, the data shows that the parallelization does not increase the total memory space requirement in contrast to the sequential version. Clearly, shared caches are beneficial for the given application. Regarding potential performance bottlenecks in the code, the functions airflowb and lqdflowb were identified as the ones which caused the most cache misses. In test cases one and two (small L2 cache) these two function contribute to nearly a quarter (22% to 24%) of all cache misses. In simulations carried out for 16 MB cache, however, they only account for 5.6% of the total misses. 6 Conclusions and Future Work Six contemporary multicore architectures were compared with a parallel reference application for sparse matrix solvers. The application was parallelized with a shared memory model under OpenMP. Thread to core pinning and cache otimization were investigated with regard to the application under consideration. The investigations showed, that, depending on the processor architecture as well as on the memory interconnect, it is not always advisable to utilize all available core in a system. With regard to cache optimization, is has been shown that potential bottlenecks can be easily detected with MMI s simulation tools. References 1. DiME DFG Project, Web Page, 2. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: ACM Annual Conference/Annual Meeting, Proceedings of the 24th national conference, pp (1969) 3. The Valgrind Developers. Valgrind Web Page,
8 48 C. Trinitis et al. 4. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, Oxford (1986) 5. Klug, T., Ott, M., Weidendorfer, J., Trinitis, C.: autopin - automated optimization of thread-to-core pinning on multicore system 3 (2008) 6. OpenMP.org. The OpenMP API specification for parallel programming, 7. Ott, M., Klug, T., Weidendorfer, J., Trinitis, C.: autopin - automated optimization of thread-to-core pinning on multicore systems. In: First Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG). Workshop proceedings, 1st Multiprog workshop, Gothenburg, Sweden (January 2008) 8. Tinney, W.F., Brandwajn, V., Chan, S.M.: Sparse vector methods. IEEE Transactions on Power Apparatus and Systems PAS-104(2) (February 1985) 9. Weidendorfer, J.: KCachegrind Web Page, Weidendorfer, J., Kowarschik, M., Trinitis, C.: A tool suite for simulation based analysis of memory access behavior. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS LNCS, vol. 3038, pp Springer, Heidelberg (2004)
Thread and Data parallelism in CPUs - will GPUs become obsolete?
Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationGuided Prefetching Based on Runtime Access Patterns
Guided Prefetching Based on Runtime Access Patterns Jie Tao 1, Georges Kneip 2, and Wolfgang Karl 2 1 Steinbuch Center for Computing Forschungszentrum Karlsruhe Karlsruhe Institute of Technology, Germany
More informationCache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching
Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching Josef Weidendorfer 1 and Carsten Trinitis 1 Technische Universität München, Germany {weidendo,trinitic}@cs.tum.edu Abstract.
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationMaximizing Six-Core AMD Opteron Processor Performance with RHEL
Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationQuad-core Press Briefing First Quarter Update
Quad-core Press Briefing First Quarter Update AMD Worldwide Server/Workstation Marketing C O N F I D E N T I A L Outstanding Dual-core Performance Toady Average of scores places AMD ahead by 2% Average
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationPerformance Impact of Resource Contention in Multicore Systems
Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J. Chang, J. Djomehri, S. Gavali, D. Jespersen, K. Taylor, R. Biswas Commodity Multicore Chips in NASA HEC 2004:
More informationCollecting and Exploiting Cache-Reuse Metrics
Collecting and Exploiting Cache-Reuse Metrics Josef Weidendorfer and Carsten Trinitis Technische Universität München, Germany {weidendo, trinitic}@cs.tum.edu Abstract. The increasing gap of processor and
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationOff-loading Application controlled Data Prefetching in numerical Codes for Multi-Core Processors
Off-loading Application controlled Data Prefetching in numerical Codes for Multi-Core Processors J. Weidendorfer* Institut für Informatik, Technische Universität München, D-8577 Garching bei München, Germany
More informationSix-Core AMD Opteron Processor
What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationOptimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory
More informationTowards the Performance Visualization of Web-Service Based Applications
Towards the Performance Visualization of Web-Service Based Applications Marian Bubak 1,2, Wlodzimierz Funika 1,MarcinKoch 1, Dominik Dziok 1, Allen D. Malony 3,MarcinSmetek 1, and Roland Wismüller 4 1
More informationMATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD
ICIT 2013 The 6 th International Conference on Information Technology MATRIX-VECTOR MULTIPLICATIO ALGORITHM BEHAVIOR I THE CLOUD Sasko Ristov, Marjan Gusev and Goran Velkoski Ss. Cyril and Methodius University,
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationCache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious
More informationStorage Formats for Sparse Matrices in Java
Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13
More informationNew Optimal Load Allocation for Scheduling Divisible Data Grid Applications
New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationHW Trends and Architectures
Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationPerformance Cockpit: An Extensible GUI Platform for Performance Tools
Performance Cockpit: An Extensible GUI Platform for Performance Tools Tianchao Li and Michael Gerndt Institut für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching bei Mu nchen,
More informationAMD Opteron Processor. Architectures for Multimedia Systems A.Y. 2009/2010 Simone Segalini
AMD Opteron Processor Architectures for Multimedia Systems A.Y. 2009/2010 Simone Segalini A brief of history Released on April 22, 2003 (codename SledgeHammer) First processor to implement AMD64 instruction
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationIngo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.
Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationA Scalable Multiprocessor for Real-time Signal Processing
A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationefmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS
efmea RAISING EFFICIENCY OF FMEA BY MATRIX-BASED FUNCTION AND FAILURE NETWORKS Maik Maurer Technische Universität München, Product Development, Boltzmannstr. 15, 85748 Garching, Germany. Email: maik.maurer@pe.mw.tum.de
More informationPerformance Analysis of the Lattice Boltzmann Method on x86-64 Architectures
Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationA+ Guide to Managing & Maintaining Your PC, 8th Edition. Chapter 4 All About Motherboards
Chapter 4 All About Motherboards Objectives Learn about the different types and features of motherboards Learn how to use setup BIOS and physical jumpers to configure a motherboard Learn how to maintain
More informationAgenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >
Agenda Sun s x86 1. Sun s x86 Strategy 2. Sun s x86 Product Portfolio 3. Virtualization < 1 > 1. SUN s x86 Strategy Customer Challenges Power and cooling constraints are very real issues Energy costs are
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationHardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB
Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,
More informationThe MPCAM Based Multi-core Processor Architecture: A Contention Free Architecture
The MPCAM Based Multi-core Processor Architecture: A Contention Free Architecture ALLAM ABUMWAIS, Department of Computer Engineering, Near East University, LEFKOSA, CYPRUS E-mail: Allam.Abumwais@aauj.edu
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationPerformance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract
Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand Abstract...1 Introduction...2 Overview of ConnectX Architecture...2 Performance Results...3 Acknowledgments...7 For
More informationPerformance Impact of Resource Conflicts on Chip Multi-processor Servers
Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationFour-Socket Server Consolidation Using SQL Server 2008
Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationThe Future of Computing: AMD Vision
The Future of Computing: AMD Vision Tommy Toles AMD Business Development Executive thomas.toles@amd.com 512-327-5389 Agenda Celebrating Momentum Years of Leadership & Innovation Current Opportunity To
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind Parallel Performance Analysis Course, 31 October, 2010 King Abdullah University of Science and Technology, Saudi Arabia Josef Weidendorfer Computer
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationMiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces
MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind 21 th VI-HPS Tuning Workshop April 2016, Garching Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems presented by Wayne Joubert Motivation Design trends are leading to non-power of 2 core counts for multicore processors, due to layout constraints
More information2008 International ANSYS Conference
28 International ANSYS Conference Maximizing Performance for Large Scale Analysis on Multi-core Processor Systems Don Mize Technical Consultant Hewlett Packard 28 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationI/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1
Memory (RAM) Standard Industry Memory Module (SIMM) RDRAM and SDRAM Access to RAM is extremely slow compared to the speed of the processor Memory busses (front side busses FSB) run at 100MHz to 800MHz
More informationHeuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications
Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications Piotr Dziurzanski and Tomasz Maka Szczecin University of Technology, ul. Zolnierska 49, 71-210 Szczecin, Poland {pdziurzanski,tmaka}@wi.ps.pl
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationPerformance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades
Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved
More informationPerformance of the AMD Opteron LS21 for IBM BladeCenter
August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationCache Simulation Based on Runtime Instrumentation for OpenMP Applications
Cache Simulation Based on Runtime Instrumentation for OpenMP Applications Jie Tao and Josef Weidendorfer Institut für Rechnerentwurf und Fehlertoleranz Lehrstuhl für Rechnertechnik und Rechnerorganisation
More informationDeveloping a Data Driven System for Computational Neuroscience
Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationAnalysis and Optimization of the Memory Access Behavior of Applications
Analysis and Optimization of the Memory Access Behavior of Applications October 8th, 2013 David Büttner (Josef Weidendorfer) Chair for Computer Architecture (LRR) TUM, Munich, Germany My Background Chair
More informationUNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM
UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,
More informationAMD HyperTransport Technology-Based System Architecture
AMD Technology-Based ADVANCED MICRO DEVICES, INC. One AMD Place Sunnyvale, CA 94088 Page 1 AMD Technology-Based May 2002 Table of Contents Introduction... 3 AMD-8000 Series of Chipset Components Product
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationParallel Implementation of the NIST Statistical Test Suite
Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationdoctor augmented assembly code x86 assembler link Linker link Executable
A Cache Simulation Environment for OpenMP Jie Tao 1, Thomas Brandes 2,andMichael Gerndt 1 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation 2 Fraunhofer-Institute for Algorithms Institut für Informatik,
More informationWHITE PAPER FUJITSU PRIMERGY SERVERS MEMORY PERFORMANCE OF XEON 7500 (NEHALEM-EX) BASED SYSTEMS
WHITE PAPER MEMORY PERFORMANCE OF XEON 7500 (NEHALEM-EX) BASED SYSTEMS WHITE PAPER FUJITSU PRIMERGY SERVERS MEMORY PERFORMANCE OF XEON 7500 (NEHALEM-EX) BASED SYSTEMS An extraordinary, architecture-related
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationDell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance
Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance A Dell Technical White Paper Dell Product Group Armando Acosta and James Pledge THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More information