Benchmark runs of pcmalib on Nehalem and Shanghai nodes
|
|
- Willis O’Brien’
- 5 years ago
- Views:
Transcription
1 MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical Computer Science and Swiss Institute of Bioinformatics, ETH Zurich, CH 9 Zurich, Switzerland christian.mueller@inf.ethz.ch Introduction This document summarizes parallel benchmark results of pcmalib on different processors. pcmalib implements the Evolution Strategy with Covariance Matrix Adaptation(CMA-ES) and its Particle Swarm extension(ps-cma-es). In the next section we describe the conducted computer experimental set-up and the technical realization on the different platforms. Section 3 describes in detail the obtained results on the MOSAIC cluster and the best hardware/software combination of the newly conducted tests. Section presents the wall-clock times for all conducted tests, and the last section summarizes and discusses the results. Experimental Set-up We assess the parallel efficiency of our implementations of CMA-ES and PS-CMA-ES on a distributedmemorycomputerclusterontheconstrainedrandomfitnesslandscapef rand (x) = Y, wherexisdefinedin theboundedsubset[,] n R n. Foranyx, Y isdrawnfromtheuniformdistributionu(,). Each algorithm evaluates the fitness function 5 times (corresponding to drawing 5 uniformly distributed random numbers) on =,..., processor cores (for the newly tested processors we only go to = ). The number of CMA-ES instances or the swarm size in PS-CMA-ES is always chosen equal to in order to avoid cache and memory congestion effects. Distributing a problem of fixed size onto an increasing number of processors measures the strong scaling of the algorithms, where the workload per processor decreases and the communication overhead increases. The random landscape F rand ensures several properties that are indispensable for an unbiased assessment of the parallel scaling. First, the computational cost of evaluating the objective function is independent of the search dimension and the specific optimization path. Second, the random landscape guarantees that all CMA-ES instances experience the same search space. We perform three benchmarks with varying values of the communication interval I c in order to disentangle the influence of the covariance matrix Eigen-decomposition and the MPI communication in PS-CMA-ES. The first set-up considers the standard parallel CMA-ES without swarm communication, i.e., I c =. The second benchmark evaluates the performance of the standard PS-CMA-ES with I c =. Since I c is in units of generations, and increasing S ( ) also increases the number of function evaluations per generation, the number of MPI communications performed in total during the fixed 5 function evaluations decreases. Therefore, the third set-up considers PS-CMA-ES with a constant number of MPI communication steps, independent of the swarm size S. This is achieved by setting I c = /S. All three benchmarks are conducted in n =,3,5, dimensions. Each benchmark is repeated r =,...,R times. For each repetition r, we measure the elapsed wall-clock time
2 t i,r on each processor core i =,...,. The overall run time t( ) of the algorithm on processors is given by the maximum time over all processes, averaged over the R independent runs: t( ) = mean r From this, the parallel speedup s and efficiency e are defined as: max t i,r. () i=,..., s( ) = t() t( ), e() = s(). (). Reference set-up of the pcmalib on the MOSAIC cluster The Fortran library is compiled with the Intel Fortran compiler version. and optimization level O3, and linked against OpenMPI version... The tests are performed on a Gentoo..5 Linux cluster consisting of compute nodes. Each node contains Intel Xeon. GHz quad-core processors ( cores pernode) with GB ofram percore. Thenodesareconnectedbyadedicated GigabitEthernetnetwork, entirely reserved for MPI communication(there is a second, identical network for system communication). TORQUE and Maui are used as resource manager and queuing system, respectively. In order to assess the influence of intra- vs. inter-node MPI communication, the scheduler is instructed to assign MPI processes per node.. Set-up on the Nehalem node Enda O Brien from DALCO used an -core Nehalem node (Xeon 557,.93 GHz) with GB memory, running SUSE Linux...-.-smp. Three different software combinations have been tested in order to disentangle the influence of processor and numerical libraries on the speed-up.. The first and fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML) The third combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale)..3 Set-up on the Shanghai node Enda O Brien also tested an -core AMD 37 ( Shanghai ) node, with. GHz ( x quad-core processors) and GB memory. Again, the same three different software combinations have been tested on this platform.. The first and again fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML). 3. The third (and for AMD natural) combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale). 3 Results on the MOSAIC cluster and the Nehalem node with Intel software We first present and analyze the results obtained on the MOSAIC cluster (identical to the ones in Mueller9b.pdf). The second subsection contains the detailed results of the Intel compiler/mpi/mkl test on the Nehalem node.
3 3. Results on the MOSAIC cluster The measured maximum wall-clock times for all 3 benchmarks on the MOSAIC cluster are reported in Fig., the speedups in Fig., and the parallel efficiencies in Fig n= n=3 3 n= 3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). 3
4 3 n= Speedup 3 3 n=3 Speedup Speedup 3 3 n= 3 Figure : Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown).
5 Efficiency Efficiency Efficiency n= n= n= 3 Figure 3: Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are no noticeable differences between the three different test set-ups. Up to =, i.e. on a single node, the wall-clock time decreases from.5s to below s. The speedup increases up to and the efficiency decreases to..7. When using two nodes ( processes), the wall-clock time increases again, and speedup and efficiency drop considerably. This is expected as the network latency becomes the limiting factor for such a small test problem. The situation changes in higher dimensions. For n = 3, the wall-clock time of parallel CMA-ES decreases from s on a single core to below s on cores. The two PS-CMA-ES tests need around 7s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about 5 seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 5
6 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around s. While the wall-clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around 5s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are and 5, respectively, corresponding to parallel efficiencies of 5 and.75. For n =, the parallel scaling further improves. The efficiency for standard CMA-ES is.7 on cores, while standard PS-CMA-ES achieves a super-linear efficiency of.7 (due to the decreasing number of MPI communications). 3. Results of the Intel compiler/intel MKL/ Intel MPI test on the Nehalem node The measured maximum wall-clock times for all 3 benchmarks on the Nehalem node are reported in Fig., the speedups in Fig. 5, and the parallel efficiencies in Fig..
7 n= n=3 n= Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). 7
8 n= Speedup n=3 Speedup n= Speedup Figure 5: Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown).
9 Efficiency Efficiency Efficiency n= n=3 n= Figure : Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ESwith constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are again no significant differences between the three different test setups. Up to =, i.e. on a single node, the wall-clock time decreases from.s to below.s. The speedup increases up to and the efficiency decreases to.7.. For n = 3, the wall-clock time of parallel CMA-ES decreases from 7.s on a single core to below.s on cores. The two PS-CMA- ES tests need around 9.5s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around 35s. While the wall- 9
10 clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are 7.5, respectively, corresponding to parallel efficiencies of.9 and. For n =, the parallel scaling further improves. The efficiency for standard parallel CMA-ES and PS-CMA-ES is around.97 on cores. Comparison of the performance on the Nehalem and the Shanghai node for different software libraries Due to limited time from my side I decided to only show the wall-clock times of the conducted test runs. Figure 7 summarizes the three different software combinations on the Nehalem node, Figure the same combinations on the Shanghai node. Due to the automatic generation of the plots the y-axes representing the wall-clock times are scaled according to the maximum wall-clock time occurring in each test run. This makes a visual comparison between test runs a bit tricky. I m sorry for the inconvenience but this is all I have for the moment.. The Nehalem node The main observation for the Nehalem node is that, indeed, the combination of Intel compiler/mkl/mpi gives the overall best performance in all dimensions and for all settings of parallel (PS-)CMA-ES on a single core compared to the other software combinations. The speed-up is between % and 3% depending on the number of processors used, the algorithmic set-up and the dimension. E.g., for n =, the wall-clock time is below.5 s while the ACML swap and the Pathscale/ACM- L/Open MPI combinations are around 3.5 s and 3 s, respectively. For higher dimensions, the Intel compiler/acml/intel MPI combination performs slightly better than Pathscale/ACML/Open MPI, indicating that the compiler/mpi combinations are less influential than the use of the optimized LAPACK routines in ACML/MKL.. The Shanghai node The picture changes when evaluating the software combinations on the Shanghai node. Again, the Intel compiler/mkl/mpi combination outperforms the other two, especially for n =, 3. For n = 5,, however, the Intel compiler/mkl/mpi and Intel compiler/acml/mpi combinations give almost identical results. This indicates that the superiority of the MKL LAPACK routines vanishes on the AMD cores. The Pathscale/ACML/Open MPI is inferior to the first two combinations in all dimensions..3 Nehalem and Shanghai in comparison The Intel libraries in combination with the Intel cores give the overall best performance. However, the AMD cores give a better average performance independent of the libraries used. All three AMD set-ups are,e.g., fasterthansoftwarecombinationand3onthe Nehalemnodeforn = 5,. However,asEnda O Brien already pointed out in the correspondence, the LAPACK libraries should be considered as real part of the system, and as such, the Intel combinations are undoubtedly the best one for pcmalib.
11 n= n= n= n= n= n= 5 n= n= n=3 n=5 n=5 n=5 Figure 7: Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Nehalem node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.
12 n= n=3 n=5 n= n= n= n=5 n= n= n= n= n=3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Shanghai node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.
13 5 Concluding remarks In this document I tried to summarize the results Enda O Brien kindly generated with our pcmalib test bed. My general conclusion would be that, indeed, the Intel cores together with the Intel software libraries achieve the best single-core performance. As pcmalib heavily relies on LAPACK routines for eigenvector decomposition, the results show the excellent performance of the Intel MKL in combination with Intel cores. Compared to our previous results on the MOSAIC cluster, the Nehalem node almost halves the wall-clock time for the n = set-up. Compared to our published results, a significant speed-up is also gained with the Shanghai node. When running parallel (PS-)CMA-ES on several cores, the differences between the different hardware/software combinations become less apparent except in the case n =. Although we believe that the given test bed provides a very good set-up to evaluate speed-up and scalability of pcmalib on different processor cores, it is not clear how well the results generalize to the application of real-world optimization tasks, e.g., when the fitness/objective function involves more complex numerical operations than generating pseudo-random numbers. 3
Assessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationPerformance, Power, Die Yield. CS301 Prof Szajda
Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationWhitepaper / Benchmark
Whitepaper / Benchmark Web applications on LAMP run up to 8X faster with Dolphin Express DOLPHIN DELIVERS UNPRECEDENTED PERFORMANCE TO THE LAMP-STACK MARKET Marianne Ronström Open Source Consultant iclaustron
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationOptimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory
More informationResponse Time and Throughput
Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationMixed MPI-OpenMP EUROBEN kernels
Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Outline Short kernel description MPI and OpenMP
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationTFLOP Performance for ANSYS Mechanical
TFLOP Performance for ANSYS Mechanical Dr. Herbert Güttler Engineering GmbH Holunderweg 8 89182 Bernstadt www.microconsult-engineering.de Engineering H. Güttler 19.06.2013 Seite 1 May 2009, Ansys12, 512
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationWhite Paper. File System Throughput Performance on RedHawk Linux
White Paper File System Throughput Performance on RedHawk Linux By: Nikhil Nanal Concurrent Computer Corporation August Introduction This paper reports the throughput performance of the,, and file systems
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationParallel Performance Studies for an Elliptic Test Problem on the Cluster maya
Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Samuel Khuvis and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University of Maryland,
More informationQuantifying power consumption variations of HPC systems using SPEC MPI benchmarks
Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationHow Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp
How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp www.mcs.anl.gov/~gropp Why Measure Performance? Publish papers or sell product Engineer a solution
More information2 The Elliptic Test Problem
A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara Hafez Tari and Matthias K. Gobbert Department
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationResearch on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes
Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes D. Akhmedov, S. Yelubayev, T. Bopeyev, F. Abdoldina, D. Muratov, R.
More informationDefense Technical Information Center Compilation Part Notice
UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP023800 TITLE: A Comparative Study of ARL Linux Cluster Performance DISTRIBUTION: Approved for public release, distribution unlimited
More informationGROMACS Performance Benchmark and Profiling. August 2011
GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationPerformance comparison and optimization: Case studies using BenchIT
John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current
More informationComputer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.
Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationMonte Carlo Simulations
Monte Carlo Simulations Lecture 2 December 8, 2014 Outline 1 Random Number Generation 2 Uniform Random Variables 3 Normal Random Variables 4 Correlated Random Variables Random Number Generation Monte Carlo
More informationResearch Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library
Research Collection Report WebParFE A web interface for the high performance parallel finite element solver ParFE Author(s): Paranjape, Sumit; Kaufmann, Martin; Arbenz, Peter Publication Date: 2009 Permanent
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationDeutscher Wetterdienst
Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationThe AxParafit and AxPcoords Manual
The AxParafit and AxPcoords Manual A. Stamatakis 1, A. Auch 2, J. Meier-Kolthoff 2, and M. Göker 3 1 École Polytechnique Fédérale de Lausanne School of Computer & Communication Sciences Laboratory for
More informationPSEUDORANDOM numbers are very important in practice
Proceedings of the 2013 Federated Conference on Computer Science and Information Systems pp. 515 519 Template Library for Multi- Pseudorandom Number Recursion-based Generars Dominik Szałkowski Institute
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationSparse Training Data Tutorial of Parameter Server
Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job
More informationSolving Large Complex Problems. Efficient and Smart Solutions for Large Models
Solving Large Complex Problems Efficient and Smart Solutions for Large Models 1 ANSYS Structural Mechanics Solutions offers several techniques 2 Current trends in simulation show an increased need for
More informationAerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project
Workshop HPC enabling of OpenFOAM for CFD applications Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project A. De Maio (1), V. Krastev (2), P. Lanucara (3),
More informationYasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors
Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationUsing Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System
Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel
More informationNUMA-aware Multicore Matrix Multiplication
Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,
More informationCray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:
Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:
More informationParallel calculation of LS factor for regional scale soil erosion assessment
Parallel calculation of LS factor for regional scale soil erosion assessment Kai Liu 1, Guoan Tang 2 1 Key Laboratory of Virtual Geographic Environment (Nanjing Normal University), Ministry of Education,
More informationDeep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor
WELCOME TO Operant conditioning 1938 Spiking Neuron 1952 first neuroscience department 1964 Deep learning prevalence mid 2000s The Turing Machine 1936 Transistor 1947 First computer science department
More informationSlurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationPerformance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf
PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationInfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux
InfoBrief Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux Leveraging Oracle 9i Real Application Clusters (RAC) Technology and Red
More informationDesigning Shared Address Space MPI libraries in the Many-core Era
Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication
More informationLoad Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs
Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationSupplementary Material for The Generalized PatchMatch Correspondence Algorithm
Supplementary Material for The Generalized PatchMatch Correspondence Algorithm Connelly Barnes 1, Eli Shechtman 2, Dan B Goldman 2, Adam Finkelstein 1 1 Princeton University, 2 Adobe Systems 1 Overview
More informationLS-DYNA Productivity and Power-aware Simulations in Cluster Environments
LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationA Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering
A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering HPRCTA 2010 Stefan Craciun Dr. Alan D. George Dr. Herman Lam Dr. Jose C. Principe November 14, 2010 NSF CHREC Center ECE Department,
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationABySS Performance Benchmark and Profiling. May 2010
ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationParallel Implementation of the NIST Statistical Test Suite
Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,
More informationMATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application
MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application By Nils Oberg, Benjamin Ruddell, Marcelo H. García, and Praveen Kumar Department of Civil and Environmental Engineering
More informationEfficient Clustering and Scheduling for Task-Graph based Parallelization
Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de
More informationPARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures
PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of
More informationHPCC Random Access Benchmark Excels on Data Vortex
HPCC Random Access Benchmark Excels on Data Vortex Version 1.1 * June 7 2016 Abstract The Random Access 1 benchmark, as defined by the High Performance Computing Challenge (HPCC), tests how frequently
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationImproved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes
Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015 NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real
More informationDELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)
DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationIntel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*
Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products
More informationIBM InfoSphere Streams v4.0 Performance Best Practices
Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related
More informationCSCE 626 Experimental Evaluation.
CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you
More informationCommunication Models for Resource Constrained Hierarchical Ethernet Networks
Communication Models for Resource Constrained Hierarchical Ethernet Networks Speaker: Konstantinos Katrinis # Jun Zhu +, Alexey Lastovetsky *, Shoukat Ali #, Rolf Riesen # + Technical University of Eindhoven,
More information