Benchmark runs of pcmalib on Nehalem and Shanghai nodes

Size: px
Start display at page:

Download "Benchmark runs of pcmalib on Nehalem and Shanghai nodes"

Transcription

1 MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical Computer Science and Swiss Institute of Bioinformatics, ETH Zurich, CH 9 Zurich, Switzerland christian.mueller@inf.ethz.ch Introduction This document summarizes parallel benchmark results of pcmalib on different processors. pcmalib implements the Evolution Strategy with Covariance Matrix Adaptation(CMA-ES) and its Particle Swarm extension(ps-cma-es). In the next section we describe the conducted computer experimental set-up and the technical realization on the different platforms. Section 3 describes in detail the obtained results on the MOSAIC cluster and the best hardware/software combination of the newly conducted tests. Section presents the wall-clock times for all conducted tests, and the last section summarizes and discusses the results. Experimental Set-up We assess the parallel efficiency of our implementations of CMA-ES and PS-CMA-ES on a distributedmemorycomputerclusterontheconstrainedrandomfitnesslandscapef rand (x) = Y, wherexisdefinedin theboundedsubset[,] n R n. Foranyx, Y isdrawnfromtheuniformdistributionu(,). Each algorithm evaluates the fitness function 5 times (corresponding to drawing 5 uniformly distributed random numbers) on =,..., processor cores (for the newly tested processors we only go to = ). The number of CMA-ES instances or the swarm size in PS-CMA-ES is always chosen equal to in order to avoid cache and memory congestion effects. Distributing a problem of fixed size onto an increasing number of processors measures the strong scaling of the algorithms, where the workload per processor decreases and the communication overhead increases. The random landscape F rand ensures several properties that are indispensable for an unbiased assessment of the parallel scaling. First, the computational cost of evaluating the objective function is independent of the search dimension and the specific optimization path. Second, the random landscape guarantees that all CMA-ES instances experience the same search space. We perform three benchmarks with varying values of the communication interval I c in order to disentangle the influence of the covariance matrix Eigen-decomposition and the MPI communication in PS-CMA-ES. The first set-up considers the standard parallel CMA-ES without swarm communication, i.e., I c =. The second benchmark evaluates the performance of the standard PS-CMA-ES with I c =. Since I c is in units of generations, and increasing S ( ) also increases the number of function evaluations per generation, the number of MPI communications performed in total during the fixed 5 function evaluations decreases. Therefore, the third set-up considers PS-CMA-ES with a constant number of MPI communication steps, independent of the swarm size S. This is achieved by setting I c = /S. All three benchmarks are conducted in n =,3,5, dimensions. Each benchmark is repeated r =,...,R times. For each repetition r, we measure the elapsed wall-clock time

2 t i,r on each processor core i =,...,. The overall run time t( ) of the algorithm on processors is given by the maximum time over all processes, averaged over the R independent runs: t( ) = mean r From this, the parallel speedup s and efficiency e are defined as: max t i,r. () i=,..., s( ) = t() t( ), e() = s(). (). Reference set-up of the pcmalib on the MOSAIC cluster The Fortran library is compiled with the Intel Fortran compiler version. and optimization level O3, and linked against OpenMPI version... The tests are performed on a Gentoo..5 Linux cluster consisting of compute nodes. Each node contains Intel Xeon. GHz quad-core processors ( cores pernode) with GB ofram percore. Thenodesareconnectedbyadedicated GigabitEthernetnetwork, entirely reserved for MPI communication(there is a second, identical network for system communication). TORQUE and Maui are used as resource manager and queuing system, respectively. In order to assess the influence of intra- vs. inter-node MPI communication, the scheduler is instructed to assign MPI processes per node.. Set-up on the Nehalem node Enda O Brien from DALCO used an -core Nehalem node (Xeon 557,.93 GHz) with GB memory, running SUSE Linux...-.-smp. Three different software combinations have been tested in order to disentangle the influence of processor and numerical libraries on the speed-up.. The first and fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML) The third combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale)..3 Set-up on the Shanghai node Enda O Brien also tested an -core AMD 37 ( Shanghai ) node, with. GHz ( x quad-core processors) and GB memory. Again, the same three different software combinations have been tested on this platform.. The first and again fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML). 3. The third (and for AMD natural) combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale). 3 Results on the MOSAIC cluster and the Nehalem node with Intel software We first present and analyze the results obtained on the MOSAIC cluster (identical to the ones in Mueller9b.pdf). The second subsection contains the detailed results of the Intel compiler/mpi/mkl test on the Nehalem node.

3 3. Results on the MOSAIC cluster The measured maximum wall-clock times for all 3 benchmarks on the MOSAIC cluster are reported in Fig., the speedups in Fig., and the parallel efficiencies in Fig n= n=3 3 n= 3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). 3

4 3 n= Speedup 3 3 n=3 Speedup Speedup 3 3 n= 3 Figure : Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown).

5 Efficiency Efficiency Efficiency n= n= n= 3 Figure 3: Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are no noticeable differences between the three different test set-ups. Up to =, i.e. on a single node, the wall-clock time decreases from.5s to below s. The speedup increases up to and the efficiency decreases to..7. When using two nodes ( processes), the wall-clock time increases again, and speedup and efficiency drop considerably. This is expected as the network latency becomes the limiting factor for such a small test problem. The situation changes in higher dimensions. For n = 3, the wall-clock time of parallel CMA-ES decreases from s on a single core to below s on cores. The two PS-CMA-ES tests need around 7s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about 5 seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 5

6 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around s. While the wall-clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around 5s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are and 5, respectively, corresponding to parallel efficiencies of 5 and.75. For n =, the parallel scaling further improves. The efficiency for standard CMA-ES is.7 on cores, while standard PS-CMA-ES achieves a super-linear efficiency of.7 (due to the decreasing number of MPI communications). 3. Results of the Intel compiler/intel MKL/ Intel MPI test on the Nehalem node The measured maximum wall-clock times for all 3 benchmarks on the Nehalem node are reported in Fig., the speedups in Fig. 5, and the parallel efficiencies in Fig..

7 n= n=3 n= Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). 7

8 n= Speedup n=3 Speedup n= Speedup Figure 5: Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown).

9 Efficiency Efficiency Efficiency n= n=3 n= Figure : Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ESwith constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are again no significant differences between the three different test setups. Up to =, i.e. on a single node, the wall-clock time decreases from.s to below.s. The speedup increases up to and the efficiency decreases to.7.. For n = 3, the wall-clock time of parallel CMA-ES decreases from 7.s on a single core to below.s on cores. The two PS-CMA- ES tests need around 9.5s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around 35s. While the wall- 9

10 clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are 7.5, respectively, corresponding to parallel efficiencies of.9 and. For n =, the parallel scaling further improves. The efficiency for standard parallel CMA-ES and PS-CMA-ES is around.97 on cores. Comparison of the performance on the Nehalem and the Shanghai node for different software libraries Due to limited time from my side I decided to only show the wall-clock times of the conducted test runs. Figure 7 summarizes the three different software combinations on the Nehalem node, Figure the same combinations on the Shanghai node. Due to the automatic generation of the plots the y-axes representing the wall-clock times are scaled according to the maximum wall-clock time occurring in each test run. This makes a visual comparison between test runs a bit tricky. I m sorry for the inconvenience but this is all I have for the moment.. The Nehalem node The main observation for the Nehalem node is that, indeed, the combination of Intel compiler/mkl/mpi gives the overall best performance in all dimensions and for all settings of parallel (PS-)CMA-ES on a single core compared to the other software combinations. The speed-up is between % and 3% depending on the number of processors used, the algorithmic set-up and the dimension. E.g., for n =, the wall-clock time is below.5 s while the ACML swap and the Pathscale/ACM- L/Open MPI combinations are around 3.5 s and 3 s, respectively. For higher dimensions, the Intel compiler/acml/intel MPI combination performs slightly better than Pathscale/ACML/Open MPI, indicating that the compiler/mpi combinations are less influential than the use of the optimized LAPACK routines in ACML/MKL.. The Shanghai node The picture changes when evaluating the software combinations on the Shanghai node. Again, the Intel compiler/mkl/mpi combination outperforms the other two, especially for n =, 3. For n = 5,, however, the Intel compiler/mkl/mpi and Intel compiler/acml/mpi combinations give almost identical results. This indicates that the superiority of the MKL LAPACK routines vanishes on the AMD cores. The Pathscale/ACML/Open MPI is inferior to the first two combinations in all dimensions..3 Nehalem and Shanghai in comparison The Intel libraries in combination with the Intel cores give the overall best performance. However, the AMD cores give a better average performance independent of the libraries used. All three AMD set-ups are,e.g., fasterthansoftwarecombinationand3onthe Nehalemnodeforn = 5,. However,asEnda O Brien already pointed out in the correspondence, the LAPACK libraries should be considered as real part of the system, and as such, the Intel combinations are undoubtedly the best one for pcmalib.

11 n= n= n= n= n= n= 5 n= n= n=3 n=5 n=5 n=5 Figure 7: Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Nehalem node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.

12 n= n=3 n=5 n= n= n= n=5 n= n= n= n= n=3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Shanghai node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.

13 5 Concluding remarks In this document I tried to summarize the results Enda O Brien kindly generated with our pcmalib test bed. My general conclusion would be that, indeed, the Intel cores together with the Intel software libraries achieve the best single-core performance. As pcmalib heavily relies on LAPACK routines for eigenvector decomposition, the results show the excellent performance of the Intel MKL in combination with Intel cores. Compared to our previous results on the MOSAIC cluster, the Nehalem node almost halves the wall-clock time for the n = set-up. Compared to our published results, a significant speed-up is also gained with the Shanghai node. When running parallel (PS-)CMA-ES on several cores, the differences between the different hardware/software combinations become less apparent except in the case n =. Although we believe that the given test bed provides a very good set-up to evaluate speed-up and scalability of pcmalib on different processor cores, it is not clear how well the results generalize to the application of real-world optimization tasks, e.g., when the fitness/objective function involves more complex numerical operations than generating pseudo-random numbers. 3

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Performance comparison between a massive SMP machine and clusters

Performance comparison between a massive SMP machine and clusters Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Performance, Power, Die Yield. CS301 Prof Szajda

Performance, Power, Die Yield. CS301 Prof Szajda Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Single-Points of Performance

Single-Points of Performance Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical

More information

Whitepaper / Benchmark

Whitepaper / Benchmark Whitepaper / Benchmark Web applications on LAMP run up to 8X faster with Dolphin Express DOLPHIN DELIVERS UNPRECEDENTED PERFORMANCE TO THE LAMP-STACK MARKET Marianne Ronström Open Source Consultant iclaustron

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory

More information

Response Time and Throughput

Response Time and Throughput Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Mixed MPI-OpenMP EUROBEN kernels

Mixed MPI-OpenMP EUROBEN kernels Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Outline Short kernel description MPI and OpenMP

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

TFLOP Performance for ANSYS Mechanical

TFLOP Performance for ANSYS Mechanical TFLOP Performance for ANSYS Mechanical Dr. Herbert Güttler Engineering GmbH Holunderweg 8 89182 Bernstadt www.microconsult-engineering.de Engineering H. Güttler 19.06.2013 Seite 1 May 2009, Ansys12, 512

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

White Paper. File System Throughput Performance on RedHawk Linux

White Paper. File System Throughput Performance on RedHawk Linux White Paper File System Throughput Performance on RedHawk Linux By: Nikhil Nanal Concurrent Computer Corporation August Introduction This paper reports the throughput performance of the,, and file systems

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Samuel Khuvis and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University of Maryland,

More information

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp

How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp www.mcs.anl.gov/~gropp Why Measure Performance? Publish papers or sell product Engineer a solution

More information

2 The Elliptic Test Problem

2 The Elliptic Test Problem A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara Hafez Tari and Matthias K. Gobbert Department

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes D. Akhmedov, S. Yelubayev, T. Bopeyev, F. Abdoldina, D. Muratov, R.

More information

Defense Technical Information Center Compilation Part Notice

Defense Technical Information Center Compilation Part Notice UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP023800 TITLE: A Comparative Study of ARL Linux Cluster Performance DISTRIBUTION: Approved for public release, distribution unlimited

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Performance comparison and optimization: Case studies using BenchIT

Performance comparison and optimization: Case studies using BenchIT John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current

More information

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Monte Carlo Simulations

Monte Carlo Simulations Monte Carlo Simulations Lecture 2 December 8, 2014 Outline 1 Random Number Generation 2 Uniform Random Variables 3 Normal Random Variables 4 Correlated Random Variables Random Number Generation Monte Carlo

More information

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library Research Collection Report WebParFE A web interface for the high performance parallel finite element solver ParFE Author(s): Paranjape, Sumit; Kaufmann, Martin; Arbenz, Peter Publication Date: 2009 Permanent

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Deutscher Wetterdienst

Deutscher Wetterdienst Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

The AxParafit and AxPcoords Manual

The AxParafit and AxPcoords Manual The AxParafit and AxPcoords Manual A. Stamatakis 1, A. Auch 2, J. Meier-Kolthoff 2, and M. Göker 3 1 École Polytechnique Fédérale de Lausanne School of Computer & Communication Sciences Laboratory for

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the 2013 Federated Conference on Computer Science and Information Systems pp. 515 519 Template Library for Multi- Pseudorandom Number Recursion-based Generars Dominik Szałkowski Institute

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Sparse Training Data Tutorial of Parameter Server

Sparse Training Data Tutorial of Parameter Server Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job

More information

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models Solving Large Complex Problems Efficient and Smart Solutions for Large Models 1 ANSYS Structural Mechanics Solutions offers several techniques 2 Current trends in simulation show an increased need for

More information

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project Workshop HPC enabling of OpenFOAM for CFD applications Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project A. De Maio (1), V. Krastev (2), P. Lanucara (3),

More information

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe: Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:

More information

Parallel calculation of LS factor for regional scale soil erosion assessment

Parallel calculation of LS factor for regional scale soil erosion assessment Parallel calculation of LS factor for regional scale soil erosion assessment Kai Liu 1, Guoan Tang 2 1 Key Laboratory of Virtual Geographic Environment (Nanjing Normal University), Ministry of Education,

More information

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor WELCOME TO Operant conditioning 1938 Spiking Neuron 1952 first neuroscience department 1964 Deep learning prevalence mid 2000s The Turing Machine 1936 Transistor 1947 First computer science department

More information

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux InfoBrief Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux Leveraging Oracle 9i Real Application Clusters (RAC) Technology and Red

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Supplementary Material for The Generalized PatchMatch Correspondence Algorithm

Supplementary Material for The Generalized PatchMatch Correspondence Algorithm Supplementary Material for The Generalized PatchMatch Correspondence Algorithm Connelly Barnes 1, Eli Shechtman 2, Dan B Goldman 2, Adam Finkelstein 1 1 Princeton University, 2 Adobe Systems 1 Overview

More information

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering HPRCTA 2010 Stefan Craciun Dr. Alan D. George Dr. Herman Lam Dr. Jose C. Principe November 14, 2010 NSF CHREC Center ECE Department,

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

Parallel Implementation of the NIST Statistical Test Suite

Parallel Implementation of the NIST Statistical Test Suite Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,

More information

MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application

MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application By Nils Oberg, Benjamin Ruddell, Marcelo H. García, and Praveen Kumar Department of Civil and Environmental Engineering

More information

Efficient Clustering and Scheduling for Task-Graph based Parallelization

Efficient Clustering and Scheduling for Task-Graph based Parallelization Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de

More information

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of

More information

HPCC Random Access Benchmark Excels on Data Vortex

HPCC Random Access Benchmark Excels on Data Vortex HPCC Random Access Benchmark Excels on Data Vortex Version 1.1 * June 7 2016 Abstract The Random Access 1 benchmark, as defined by the High Performance Computing Challenge (HPCC), tests how frequently

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015 NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008* Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

CSCE 626 Experimental Evaluation.

CSCE 626 Experimental Evaluation. CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you

More information

Communication Models for Resource Constrained Hierarchical Ethernet Networks

Communication Models for Resource Constrained Hierarchical Ethernet Networks Communication Models for Resource Constrained Hierarchical Ethernet Networks Speaker: Konstantinos Katrinis # Jun Zhu +, Alexey Lastovetsky *, Shoukat Ali #, Rolf Riesen # + Technical University of Eindhoven,

More information