Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Similar documents
Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Some notes on efficient computing and high performance computing environments

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Achieve Better Performance with PEAK on XSEDE Resources

ATLAS (Automatically Tuned Linear Algebra Software),

Installing the Quantum ESPRESSO distribution

Intel Math Kernel Library

Mathematical libraries at the CHPC

Chapter 24a More Numerics and Parallelism

Programming Languages and Compilers. Jeff Nucciarone AERSP 597B Sept. 20, 2004

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

BLAS. Basic Linear Algebra Subprograms

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

Our new HPC-Cluster An overview

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Adaptive Scientific Software Libraries

BLAS. Christoph Ortner Stef Salvini

Mixed MPI-OpenMP EUROBEN kernels

Lecture 3: Intro to parallel machines and models

Resources for parallel computing

Dense matrix algebra and libraries (and dealing with Fortran)

Intel Performance Libraries

A Few Numerical Libraries for HPC

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Cell Processor and Playstation 3

Intel Math Kernel Library 10.3

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Scientific Computing. Some slides from James Lambers, Stanford

Ramon Huerta ECSS Project Summary!!

Advanced School in High Performance and GRID Computing November 2008

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

9. Linear Algebra Computation

Introduction to High-Performance Computing

4. LS-DYNA Anwenderforum, Bamberg 2005 IT I. September 28, 2005 Computation Products Group 1. September 28, 2005 Computation Products Group 2

Code optimization techniques

Practical High Performance Computing

Scientific Programming in C XIV. Parallel programming

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

INTEL MKL Vectorized Compact routines

Automatic Development of Linear Algebra Libraries for the Tesla Series

Performance comparison and optimization: Case studies using BenchIT

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Fast Multiscale Algorithms for Information Representation and Fusion

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

How to compile Fortran program on application server

How to Write Fast Code , spring st Lecture, Jan. 14 th

Coding Tools. (Lectures on High-performance Computing for Economists VI) Jesús Fernández-Villaverde 1 and Pablo Guerrón 2 March 25, 2018

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Compilers and Optimization

A quick guide to Fortran

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Installation of OpenMX

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Software Ecosystem for Arm-based HPC

Quantum ESPRESSO on GPU accelerated systems

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Optimizing Parallel Electronic Structure Calculations Through Customized Software Environment

Introducing coop: Fast Covariance, Correlation, and Cosine Operations

CUDA. Matthew Joyner, Jeremy Williams

Maximizing performance and scalability using Intel performance libraries

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

SPIKE Documentation v1.0

Benchmark Results. 2006/10/03

Application Performance on Dual Processor Cluster Nodes

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Performance comparison between a massive SMP machine and clusters

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Compilers & Optimized Librairies

Fastest and most used math library for Intel -based systems 1

Parallelism in Spiral

Cray Scientific Libraries. Overview

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Vienna Scientific Cluster: Problems and Solutions

Shared Symmetric Memory Systems

Parallel Performance and Optimization

Technology for a better society. hetcomp.com

Practical Introduction to CUDA and GPU

Transcription:

1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street PA 19104 Philadelphia U.S.A.

Mathematical Libraries (Part 1) ICTP Advanced School in High Performance and GRID Computing Axel Kohlmeyer Center for Molecular Modeling ICTP, Trieste - Italy, 05 November 2008 Penn UNIVERSITY of PENNSYLVANIA

Overview Opportunities to improve application performance on modern CPUs Why use performance libraries? Using / linking libraries Example: Using DGEMM in BLAS There are better ways to log off.

Using Cache Efficiently Cache is a fast but small memory area Located on the CPU or close to it Compensate for discrepancy between CPU speed (fast) and Memory speed (slow) Typically transparently, and kept coherent in multi-core, multi-cpu environment. The key for good performance is to write code that maximized the effect of cache memory The optimal code structure (blocking) depend on CPU speed, model, and architecture

Using Special Instructions Modern CPUs contain many special purpose instructions: MMX,SSE,3d-now,Altivec Many allow to operate on multiple data elements in parallel: SIMD, vector instructions Programming in assembly required to use them efficiently In general not portable between different CPU architectures and models Significant speedups for Linear Algebra operations, signal processing.

Using Multi-core / Multi-CPU Modern CPUs contain multiple CPU cores Need parallel program (MPI, OpenMP,...) to exploit the additional capability Parallelism in code need rewrites/restructuring (MPI), or instrumentation (OpenMP) void) prii^tf ("I WM ntfttvwow paperd

What are performance libraries? Routines for common (math) functions such as vector and matrix operations, fast Fourier transform etc. written in a specific way to take advantage of capabilities of the CPU. Each CPU type normally has its own version of the library specifically written or compiled to maximally exploit that architecture Make coding easier. Complicated math operations can be used from existing routines Increase portability of code as standard (and well optimized) libraries exist for ALL computing platforms.

Why use performance libraries? Compilers can optimize code only to a certain point (they are dumb). Effective programming needs deep knowledge of the platform Performance libraries are designed to use the CPU in the most efficient way, which is not necessarily the most straightforward way. It is normally best to use the libraries supplied by or recommended by the CPU vendor On modern hardware they are hugely important, as they most efficiently exploit caches, special instructions and parallelism

Standardization Subroutines have a standardized layout BLAS is documented in the source code Man pages exist Vendor supplied docs Different BLAS implementations have the same calling sequence SUBROUTINE DGEMH ( TRflNSfl, TRflNSB, M, N, K, fllphpl, fl, LDfl, B, LDB, BETfl, C, LDC >.. SCflLAR flrguments.. CHflRflCTERl TRflNSfl, TRflNSB INTEGER ft, N, K, LDA, LDB, LDC DOUBLE PRECISION fllphfl, BETfl.. flrrfly flrguments.. DOUBLE PRECISION fl< LDfl, >, B( LDB, ), C< LDC, > PURPOSE DGEMM PERFORMS ONE OF THE MflTRIX-MflTRIX OPERflTIONS C := fllphflop( fl >OP< B > + BETflC, WHERE OP( X } IS ONE OF OP( X > = X OR OP( X > = X 1, fllphfl PiND BETfl flre SCflLflRS, flnd fl, B flnd C flre MflTRICES, WITH 0P< fl ) PIN M BY K MPlTRIX, OP( B > fl K BY N MflTRIX flnd C fln M BY N MflTRIX, PflRPlMETERS TRflNSfl - CHflRflCTERl. ON ENTRY, TRflNSfl SPECIFIES THE FORM OF OP< fl > TO BE USED IN THE MflTRIX MULTIPLICflTION fls FOLLONS: TRflNSfl = 'N' OR 'N', OP( fl ) = fl. TRflNSfl = ' T ' OR ' T ', OP < fl ) = fl '. TRflNSfl = 'C OR 'C, OP( fl ) = fl 1. UNCHflNGED ON EXIT. TRflNSB - CHARACTER!;. ON ENTRY, TRflNSB SPECIFIES THE FORM OF OP( B ) TO BE USED IN THE MflTRIX MULTIPLICflTION fls FOLLONS: TRflNSB = 'N' OR 'N', OP( B > = B, TRflNSB = ' T ' OR ' T ', OP ( B > = B '.

How to use libraries in code Crroussea@samson qmc_code]$ more alldet.f90 subroutine alldet(iopt,indt,nelorb,nelup,neldo,ainv,uinv l,ainvupb,derl) implicit none integer iopt,nelorb,nelup,neldo,nelt,i,j,nelorb5 l,indt real8 ainvcnelup,),ainvupb(nelup,) 1,derl(nelorb,«),winv(nelorb,0:indt+4,») nelorb5=(indt+5)«nelorb call dgemm('n','t',nelup,nelorb,nelup,1.do,ainv,nelup 1,uinv(1,0,1),nelorb5,0.dO,ainvupb,nelup) call dgemm('n','n',nelorb,nelorb,neldo,1.do,uiinv(l,o,nelup+l) I,nelorb5,ainvupb,nelup,0.dO,derl,nelorb) Within your code you simply need to call the BLAS/LAPACK routines As if they are subroutines you would normally write. Note check the BLAS/LAPACK manuals to know the name of routine And what variables need to be passed to them and in what order. NB DGEMM double generic matrix-matrix multiplication

BLAS/LAPACK Implementations BLAS and LAPACK reference (portable Fortran77) Intel Math Kernel Library (MKL): BLAS/LAPACK (+FFT) routines modified for best performance on IA32/x86, x86_64/amd64/em64t, and IA64 based machines. AMD Math Core Library (ACML): BLAS LAPACK (+FFT) routines modified for best performance on AMD x86 and x86_64 based machines Automatically Tuned Linear Algebra Software (ATLAS): BLAS and LAPACK routines that use empirical tests to tune machine specific parameters. GOTO BLAS, a BLAS implementation with special optimization for modern (x86/x86_64) CPU architectures

Linking Your Code to Libraries Linker flags: -L/some/other/dir -Im -> search for libm.so/libm.a also in /some/dir Order matters. Ex.: LAPACK uses BLAS => -L/usr/local/lib -llapack -Iblas ATLAS is written C with Ml wrappers: => -L/opt/atlas/lib -If77blas -latlas MKL uses "collections". Using -Imkl links: Iibmkl_intel_lp64.so, libmkl_intel_thread.so libmkl_core.so => check with "Idd" Check LD LIBRARY PATH with shared libs

Library Performance Example O-O 2.6GHz Opteron acml 2.5 E3-O 2.6GHz Opteron goto 2.6GHz Opteron atlas 2.6GHz Opteron acml 2.6 2.6GHz Opteron lapack 5 a> o c CO O-O 3.0Ghz EMT54 2MB cache mkl 7.2 3.DGhz EMT64 2MB cache atlas 3.0Ghz EMT64 2MB cache atlas/threads 3.0Ghz EMT54 2MB cache lapack 3.0Ghz EMT64 IMBcachsmkl 7.2 500 1000 1500 Matrix Size /N 2000 500 1000 1500 Matrix Size /N 2000 The absolute and relative performance of a library is very platform dependent Vendor optimized libraries are most of the time faster.

Comments on LAPACK and BLAS BLAS Levels: 1) vector, 2) vector/matrix, 3) matrix BLAS Level 1, has N memory operations for N CPU operations. Whereas Level 3 has 2N 2 memory/n 4 CPU => more efficient. Thus combining operations over vectors into matrix-matrix operations is useful Reference LAPACK/BLAS are better than "coding by hand", BUT are less efficient than the vendor supplied versions of the same library or ATLAS. LAPACK relies heavily on BLAS so the key to good performance is a fast BLAS.

Beware! Multi-threaded Libraries FFTW: Thread parallelization through additional library, needs special management ATLAS: fixed number of threads compiled in MKL: Threading with OpenMP - Uses by default all available cores => conflicts with MPI, NUMA-SMP workstations - Modular, can be tuned to compiler, linker, threads - Default linking sequence for ifort with Intel threads => add -openmp to linker/compiler command - Serial: -Imkl_intel_lp64 -lmkl_sequential -lmkl_core - GNU: -Imkl_gf_lp64 -lmkl_gnu_thread -lmkl_core - See documentation for details (always RTFM!).