Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing

Size: px
Start display at page:

Download "Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing"

Transcription

1 Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing Nagarajan Kathiresan, Ph.D., IBM India, Bangalore.

2 Agenda :- Different types of cluster/smp architectures Trade-off between Price and Performance Memory bound and CPU bound applications Programming Models Hybrid Scenario Processor binding and Memory affinity Advantages of Hybrid Parallelization Case Study Molecular Dynamics Application Conclusion Acknowledgement

3 Different types of Cluster/SMP Architectures Shared Memory MIMD Easy to build & shared memory communication. Limitation : Reliability & expandability :- A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Distributed Memory MIMD Communication: Inter Process Communication (IPC). Network can be configured like Tree, Mesh, Cube, etc. Unlike Shared memory MIMD Easily to expandable Highly reliable (any CPU failure does not affect the whole system) Processor Processor A M E M B O U R S Y IPC channel Processor Processor B M E M B O U R S Y M E M B O U R S Y Global Global Memory Memory System System Processor Processor A A M E M B O U R S Y Memory Memory System A System A Processor Processor B B M E M B O U R S Y Memory Memory System B System B Processor Processor C Processor Processor C C M E M B O U R S Y Memory Memory System System C C

4 Cluster of SMP Architectures Multichip nodes Each node has Single, two, four, eight and sixteen chips Multicore chips Each chip has 2,4,6,8 cores etc Memory is associated with chips NUMA architecture - More accessible from cores on same chip

5 Trade-off between Price and Performance Uniprocessors architecture performance is limited due to application may not possible to run in parallel manner (Eg:- Architecture limitation). SMP architecture scalability is limited due to shared memory contention. Cluster of SMP performance (Pure MPI based applications) is saturated (at some point) due to imbalance in communication across MPI processes, unable to increase memory conception per MPI process etc. Source:

6 Memory bound and CPU bound applications Year by Year, the CPU performance (speed) is increased exponentially but memory speed is limited. Most of the HPC applications are CPU intensive and Memory intensive workloads. These HPC applications scalabilities are limited due to STREAM performance is saturated after certain number of cores. Application configuration which uses hybrid mechanism will help to fill-up the gap between memory-bound and CPU-bound workload. Source:

7 Parallel Programming Models Shared Memory Distributed Memory OpenMP C + MPI C + MPI + OpenMP

8 Hybrid Scenario (i) MPI is used to handle the communications across the SMP nodes and OpenMP within the nodes, (ii) At node level, OpenMP based threads are used to avoids the extra explicit communications that MPI would require, (iii)the shared memory/thread copy mechanism is used to increase the dynamic load balancing (iv)mpi is used to handle the parallelism across the SMP nodes for efficient communications.

9 Different Hybrid Configurations - views 16 Application threads 4 MPI x 4 Threads per task = 16 Application threads 1 MPI x 16 Threads per task = 16 Application threads

10 Processor binding and Memory affinity CPU or Processor binding To bind one or more processes to one or more processors (CPUs). Example: /bin/taskset NUMA Architecture In a NUMA architecture, memory is physically distributed throughout the machine even though it is virtually treated as a single address space. Memory affinity plays a crucial role to improve the performance for memory bound applications. Local memory (close to the MPI process running on the chip) will be faster to access than other remote memories. Ex. MEMORY_AFFINITY=MCM

11 Case Study #1 The CPMD is a parallelized plane wave / pseudopotential implementation of density functional theory, particularly designed for ab-initio molecular dynamics. Dependent Libraries to build CPMD Math library ( Eg. ESSL/MKL/ACML/ATLAS) Configuration: MPI OpenMP MPI + OpenMP (Hybrid) Configuration Flags : CC = mpicc FC = mpif77 -c LD = mpif77 -nofor_main FFLAGS = -L/opt/intel/mkl/ /lib/em64t -lmkl_em64t -lmkl_core \ -lmkl_lapack -lmkl_em64t -lm -openmp -lpthread Ref:

12 Literature Review & Reference Test Case: 32 water molecules, Si512, etc ( Standard example test cases) Each CPMD run consists of two parts: (i) An optimization step and ( Wave function Optimization) (ii) second a molecular dynamics simulation. ( Actual simulation) Ref:

13 Our Case Study Exercise:- Using Pure MPI: 1.The elapsed time and TCPU times are not good compared with Hybrid model. 2. Scalability is poor for more no. of cores. Using Hybrid model (MPI + OpenMP): 1. The elapsed time and TCPU times are improved. 2. We were able to improve the scalability & performance. Hardware Config: HS22 Intel Blade Server (2 Way Quad core) Nehalem Processor Pure MPI Performance Hybrid Performance Elapsed Tim e (in Sec) Scalability (in % ) Elapsed tim e (in Sec) Scalability (in % ) No. of Nodes No. of Nodes Total elapsed time (in Sec) Scalability (in %) Elapsed Time ( Sec) Scalability (in %)

14 Summary Pure MPI vs Hybrid Optimization at runtime export I_MPI_ADJUST_ALLTOALL=4 export I_MPI_ADJUST_ALLREDUCE=5 export I_MPI_DEVICE=rdssm export I_MPI_PIN_DOMAIN=socket export OMP_NUM_THREADS=8 ie., 2 MPI process x 8 threads per node Total elapsed time (in Sec) Distributed Memory Configuration Performance improvement (in %) No of Nodes Performance improvement Distributed Shared Memory configuration AlltoAll All Reduce Ref: Intel MPI Library for Linux reference manual

15 Case Study #2 What is CP2K? CP2K is a freely available (GPL) program, written in Fortran 95, to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. Dependent Libraries Fortran 95 Compiler BLAS and LAPACK Math libraries (ESSL, MKL or ACML may be used to improve the performance) ScaLAPACK library for parallel version of CP2K FFT Library (Optional but based on input test case requirement) Libint library (required only when the input file uses Hartree-Fock exchange (HFX) calculations) MPI Library for parallel Configuration: MPI Version OpenMP Version MPI/OpenMP Hybrid version C P 2 K Ref:

16 Challenges in Fortran 95 Compilers PGI, Pathscale and GNU Fortran compilers works fine with Pure MPI and Hybrid (MPI + OpenMP) configuration. Intel Fortran ( ) may not work for CP2K application which uses hybrid configuration (only for HFX based calculations). Therefore, GNU Fortran compiler is an alternative for Intel architecture. Gfortran or later version will do the array operations are parallelized automatically by the workshare directives. Therefore, hybrid configuration using GNU fortran version or later is mandatory. Function pointers support (?) absent in Intel Fortran and hence need to compile with ISO_C_BINDING support for Pure MPI version. Due to some bugs(?) in Intel math kernel library (MKL), we were unable to link with Intel MKL (for Hybrid model). Hence, we used open source Math libraries (ScaLAPACK, BLAS, FFTW, LAPACK etc)

17 Example Hybrid Configuration -D GRID_CORE=X (with X=1..6) specific optimized core routines can be selected -D MAX_CONTR defines the maximum CC = gcc angular momentum up to which specialized code FC = mpif90 will be used LD = mpif90 AR = ar r DFLAGS = -D GFORTRAN -D FFTSG -D LIBINT -D parallel -D SCALAPACK \ -D BLACS -D FFTW3 -D MAX_CONTR=3 (optional ) \ -D GRID_CORE=2 (optional) FCFLAGS = -I <fftw-3.2.2/include> -O3 -fopenmp -ffast-math -funroll-loops -ftree-vectorize - march=native -ffree-form $(DFLAGS) LDFLAGS = $(FCFLAGS) LIBS = libint_cpp_wrapper.o \ ( From tools/hfx_tools/libint_tools/libint_cpp_wrapper.cpp ) libderiv.a \ libint.a \ libscalapack.a \ blacs_mpi-linux-0.a \ blacscinit_mpi-linux-0.a \ blacsf77init_mpi-linux-0.a \ blacs_mpi-linux-0.a \ lapack_linux.a \ blas_linux.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ \ -lpthread OBJECTS_ARCHITECTURE = machine_gfortran.o

18 Performance and Scalability R atio between co mpute and co mmunicatio n using P ure M P I 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputatio n time No. of Cores, No. of Nodes Co mmunicatio n time Even though, the computational time reduces by increasing number of core, the overall performance was poor due to tremendous increase in communication time. Input: regtest-hfx based input files) R atio between C o mput and C o mmunicatio n using H ybrid (M P I+ OpenM P ) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputation Time No. of Cores, No. of nodes Co mmunication time Using hybrid parallelization paradigm, the ever increasing (with increasing number of cores) communication time is reduced by 49%, 530%, 814% and 579% respectively for 264, 516, 1032 and 2052 cores compared to Pure MPI patent.

19 Wall time The hybrid method of parallelization Pure MPI - Communication and Compute statistics 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% R atio between co mpute and co mmunicatio n using P ure M P I 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputatio n time No. of Cores, No. of Nodes Communicatio n time Max. and Min Compute & Communication using Pure MPI No. of Cores, No. of Nodes M aximum Compute M inimum Compute M aximum Communication M inimum Communication Most of the communication time spend in MPI_Bareer() and MPI_Wait() calls due to highly imbalance in the communication. The minimum and maximum communication ratio is, (1.08%, 44.28%), (5.94%, 63.42%), (6.97%, 66.87%) and (26.38%, 69.84%) for 256, 512, 1024 and 2048 cores respectively which is compared with communication time.

20 Hybrid parallelization W a ll tim e 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M ax and M in compute & Communication using Hybrid No. of Cores, No. of Nodes M ax. compute M ax. Communication M in. Compute M in. communication The maximum time spend on communication is not more than 36% and maximum time spend on computation is more than 98% The communication time (min and Max) is reduced tremendous and this helped out us for better load balancing.

21 Performance impact in Hybrid parallelization Scalability Performance difference between Pure M PI and Hybrid model % % % % % % 50.00% 0.00% Pure M PI Hybrid No. of Cores, No. of Nodes Performance improvement using Hybrid model export I_MPI_DEVICE=rdssm export OMP_NUM_THREADS=1 export I_MPI_PERHOST=allcores export I_MPI_PIN=1 export I_MPI_PIN_MODE=mpd export I_MPI_PIN_PROCS=allcores % of Performance difference Hyper-threading is enabled and hence double the number of cores Physical cores are mostly used to run the MPI process and logical cores are only used for running OpenMP threads. Instructions from more than one thread can be executing in any given pipeline stage at a time. A single thread rarely uses all the execution units and when a thread stalls (because the thread needs data that is not stored in the cache), the CPU ends up waiting for memory for hundreds of cycles, the second thread immediately ready to run can fill up these empty spaces, and thus prevent wasting resources. Therefore, the compute time for 1032 and 2052 cores are increased tremendously and hence the superlinear scalability is able to obtain by 17.46% and 42.94% (117% and 142% total improvement compared with pure MPI) for 1032 and 2052 cores respectively.

22 Additionally, Increase the memory consumption MPI/OpenMP by hybrid parallelism using The performance improvement using hybrid parallelism is obtained in the following manner: o With a hybrid option, we can reduce the number of MPI tasks and increase the memory per MPI task significantly. This will result in storing many of the 2-electron integrals in the memory thus reducing the time to compute them repeatedly. o With a hybrid option, we could exploit the HT technology which is not possible with pure MPI case, because of the memory constraints. o Hybrid option improves the communication performance as there are fewer MPI tasks contending for the bandwidth and improves latency.

23 Advantages of Hybrid Parallelization Improving the performance, Overlapping communication and computation, Increase the memory consumption per MPI process Nested parallelism support Improve the Load balancing Efficiently utilizing the system resources (Eg: Simultaneous Multi Threading (SMT) in Power, Hyper Threading (HT) in Intel are used to run the worker threads of MPI process), Extend the scalability towards linear/super-linearly

24 Conclusion Using the hybrid way of running the parallel program helping to improve the performance and able to achieve super linear scalability. The appropriate process binding is used to pin the MPI process at the core/socket level, and therefore, the inter-node communication efficiency is increased.

25 Acknowledgement & Thanks! Luigi Brochard IBM, Distinguished Engineer Swamy N. Kandadai IBM, Senior Executive IT Specialist Rajendra D. (Raj) Panda IBM, Software Performance Analyst Mathias Puetz IBM, Application Performance Specialist

26 Thank you!

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

How to compile Fortran program on application server

How to compile Fortran program on application server How to compile Fortran program on application server Center for Computational Materials Science, Institute for Materials Research, Tohoku University 2015.3 version 1.0 Contents 1. Compile... 1 1.1 How

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

KNL Performance Comparison: CP2K. March 2017

KNL Performance Comparison: CP2K. March 2017 KNL Performance Comparison: CP2K March 2017 1. Compilation, Setup and Input 2 Compilation CP2K and its supporting libraries libint, libxsmm, libgrid, and libxc were compiled following the instructions

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Mixed MPI-OpenMP EUROBEN kernels

Mixed MPI-OpenMP EUROBEN kernels Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Outline Short kernel description MPI and OpenMP

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications

More information

Molecular Dynamics and Quantum Mechanics Applications

Molecular Dynamics and Quantum Mechanics Applications Understanding the Performance of Molecular Dynamics and Quantum Mechanics Applications on Dell HPC Clusters High-performance computing (HPC) clusters are proving to be suitable environments for running

More information

Installation of OpenMX

Installation of OpenMX Installation of OpenMX Truong Vinh Truong Duy and Taisuke Ozaki OpenMX Group, ISSP, The University of Tokyo 2015/03/30 Download 1. Download the latest version of OpenMX % wget http://www.openmx-square.org/openmx3.7.tar.gz

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Hybrid Programming with MPI and SMPSs

Hybrid Programming with MPI and SMPSs Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel

More information

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor * Some names and brands may be claimed as the property of others. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor E.J. Bylaska 1, M. Jacquelin

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

Installing the Quantum ESPRESSO distribution

Installing the Quantum ESPRESSO distribution Joint ICTP-TWAS Caribbean School on Electronic Structure Fundamentals and Methodologies, Cartagena, Colombia (2012). Installing the Quantum ESPRESSO distribution Coordinator: A. D. Hernández-Nieves Installing

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

No Time to Read This Book?

No Time to Read This Book? Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Introduction to Compilers and Optimization

Introduction to Compilers and Optimization Introduction to Compilers and Optimization Le Yan (lyan1@cct.lsu.edu) Scientific Computing Consultant Louisiana Optical Network Initiative / LSU HPC April 1, 2009 Goals of training Acquaint users with

More information

ecse08-10: Optimal parallelisation in CASTEP

ecse08-10: Optimal parallelisation in CASTEP ecse08-10: Optimal parallelisation in CASTEP Arjen, Tamerus University of Cambridge at748@cam.ac.uk Phil, Hasnip University of York phil.hasnip@york.ac.uk July 31, 2017 Abstract We describe an improved

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu Overview Types of parallel computers. Parallel programming options. How to write

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Benchmark runs of pcmalib on Nehalem and Shanghai nodes

Benchmark runs of pcmalib on Nehalem and Shanghai nodes MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Adaptive Transpose Algorithms for Distributed Multicore Processors

Adaptive Transpose Algorithms for Distributed Multicore Processors Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Types of parallel computers. Parallel programming options. How to

More information

Cray RS Programming Environment

Cray RS Programming Environment Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,

More information

High performance computing and numerical modeling

High performance computing and numerical modeling High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic

More information

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab Live Survey Please login with your laptop/mobile h#p://'ny.cc/kslhpc And type the code VF9SKGQ6 http://hpc.kaust.edu.sa

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

What SMT can do for You. John Hague, IBM Consultant Oct 06

What SMT can do for You. John Hague, IBM Consultant Oct 06 What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700

More information

Goals of parallel computing

Goals of parallel computing Goals of parallel computing Typical goals of (non-trivial) parallel computing in electronic-structure calculations: To speed up calculations that would take too much time on a single processor. A good

More information

Practical Introduction to

Practical Introduction to 1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Hybrid programming with MPI and OpenMP On the way to exascale

Hybrid programming with MPI and OpenMP On the way to exascale Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Introduction to HPC Numerical libraries on FERMI and PLX

Introduction to HPC Numerical libraries on FERMI and PLX Introduction to HPC Numerical libraries on FERMI and PLX HPC Numerical Libraries 11-12-13 March 2013 a.marani@cineca.it WELCOME!! The goal of this course is to show you how to get advantage of some of

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information