Some notes on efficient computing and high performance computing environments

Similar documents
Brief notes on setting up semi-high performance computing environments. July 25, 2014

Intel Performance Libraries

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Achieve Better Performance with PEAK on XSEDE Resources

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Intel Math Kernel Library

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Intel Math Kernel Library 10.3

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

MAGMA: a New Generation

A Standard for Batching BLAS Operations

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Intel Math Kernel Library (Intel MKL) Latest Features

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

Resources for parallel computing

NEW ADVANCES IN GPU LINEAR ALGEBRA

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Scientific Computing. Some slides from James Lambers, Stanford

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Mathematical Libraries and Application Software on JUQUEEN and JURECA

BLAS. Basic Linear Algebra Subprograms

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab.

International Conference on Computational Science (ICCS 2017)

Parallel Systems. Project topics

BLAS. Christoph Ortner Stef Salvini

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Adrian Tate XK6 / openacc workshop Manno, Mar

Dalton/LSDalton Installation Guide

ATLAS (Automatically Tuned Linear Algebra Software),

Parallel Linear Algebra in Julia

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Cray Scientific Libraries. Overview

Fast Multiscale Algorithms for Information Representation and Fusion

Self Adapting Numerical Software (SANS-Effort)

Documentation of the chemistry-transport model. [version 2017r4] July 25, How to install required libraries under GNU/Linux

Early Experiences Writing Performance Portable OpenMP 4 Codes

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Solving Dense Linear Systems on Graphics Processors

Intel Xeon Phi Coprocessor

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

HPC with Multicore and GPUs

Practical High Performance Computing

Introducing coop: Fast Covariance, Correlation, and Cosine Operations

Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Parallel Programming & Cluster Computing

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Compilers & Optimized Librairies

TASKING Performance Libraries For Infineon AURIX MCUs Product Overview

Maximizing performance and scalability using Intel performance libraries

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

SCALABLE HYBRID PROTOTYPE

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Installation of OpenMX

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

AMD CPU Libraries User Guide Version 1.0

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Software Overview Release Rev: 3.0

High performance matrix inversion of SPD matrices on graphics processors

Intel Math Kernel Library for Linux*

A GPU Sparse Direct Solver for AX=B

SPIRAL, FFTX, and the Path to SpectralPACK

Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

How to compile Fortran program on application server

CUDA Accelerated Compute Libraries. M. Naumov

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

The AxParafit and AxPcoords Manual

BLAS: Basic Linear Algebra Subroutines I

Quantum ESPRESSO on GPU accelerated systems

Strategies for Parallelizing the Solution of Rational Matrix Equations

Transcription:

Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles. 3 Departments of Forestry and Geography, Michigan State University, East Lansing, Michigan.

Code implementation

Very useful libraries for efficient matrix computation: 1. Fortran BLAS (Basic Linear Algebra Subprograms, see Blackford et al. 2001). Started in late 70s at NASA JPL by Charles L. Lawson. 2. Fortran LAPACK (Linear Algebra Package, see Anderson et al. 1999). Started in mid 80s at Argonne and Oak Ridge National Laboratories. Modern math software has a heavy reliance on these libraries, e.g., Matlab and R. Routines are also accessible via C, C++, Python, etc. 1

Many improvements on the standard BLAS and LAPACK functions, see, e.g., Intel Math Kernel Library (MKL) AMD Core Math Library (ACML) Automatically Tuned Linear Algebra Software (ATLAS) Matrix Algebra on GPU and Multicore Architecture (MAGMA) OpenBLAS http://www.openblas.net veclib (for Mac users only) 2

Key BLAS and LAPACK functions used in our setting. Function dpotrf dtrsv dtrsm dgemv dgemm Description LAPACK routine to compute the Cholesky factorization of a real symmetric positive definite matrix. Level 2 BLAS routine to solve the systems of equations Ax = b, where x and b are vectors and A is a triangular matrix. Level 3 BLAS routine to solve the matrix equations AX = B, where X and B are matrices and A is a triangular matrix. Level 2 BLAS matrix-vector multiplication. Level 3 BLAS matrix-matrix multiplication. 3

Computing environments

Consider different environments: 1. A distributed system consists of multiple autonomous computers (nodes) that communicate through a network. A computer program that runs in a distributed system is called a distributed program. Message Passing Interface (MPI) is a specification for an Application Programming Interface (API) that allows many computers to communicate. 4

Consider different environments: 1. A distributed system consists of multiple autonomous computers (nodes) that communicate through a network. A computer program that runs in a distributed system is called a distributed program. Message Passing Interface (MPI) is a specification for an Application Programming Interface (API) that allows many computers to communicate. 2. A shared memory multiprocessing system consists of a single computer with memory that may be simultaneously accessed by one or more programs running on multiple Central Processing Units (CPUs). OpenMP (Open Multi-Processing) is an API that supports shared memory multiprocessing programming. 3. A heterogeneous system uses more than one kind of processor, e.g., CPU & (Graphics Processing Unit) GPU or CPU & Intel s Xeon Phi Many Integrated Core (MIC). 4

Which environments are right for large n settings? MCMC necessitates iterative evaluation of the likelihood which requires operations on large matrices. A specific hurdle is factorization to computing determinant and inverse of large dense covariance matrices. We try to model our way out and use computing tools to overcome the complexity (e.g., covariance tapering, Kaufman et al. 2008; low-rank methods, Cressie and Johannesson 2008; Banerjee et al. 2008, etc.). Due to slow network communication and transport of submatrices among nodes distributed systems are not ideal for these types of iterative large matrix operations. 5

My lab currently favors shared memory multiprocessing and heterogeneous systems. Newest unit is a Dell Poweredge with 384 GB of RAM, 2 threaded 10-core Xeon CPUs, and 2 Intel Xeon Phi Coprocessor with 61-cores (244 threads) running a Linux operating systems. Software includes OpenMP coupled with Intel MKL. MKL is a library of highly optimized, extensively threaded math routines designed for Xeon CPUs and Phi coprocessors (e.g., BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, and vector RNGs). 6

So what kind of speed up to expect from threaded BLAS and LAPACK libraries. Cholesky factorization (dpotrf), n=20000 Execution time (seconds) 20 40 60 80 100 120 140 Cholesky factorization (dpotrf), n=40000 Execution time (seconds) 200 400 600 800 1000 5 10 15 20 Number of Xeon cores 5 10 15 20 Number of Xeon cores 7

R and threaded BLAS and LAPACK Many core and contributed packages (including spbayes) call BLAS)and LAPACK Fortran libraries. Compile R against threaded BLAS and LAPACK provides substantial computing gains: processor specific threaded BLAS/LAPACK implementation (e.g., MKL or ACML) processor specific compilers (e.g., Intel s icc/ifort) 8

For Linux/Unix: compiling R to call MKL s BLAS and LAPACK libraries (rather than stock serial versions). Change your config.site file and configure call in the R source code directory. My config.site file CC=icc CFLAGS="-g -O3 -wd188 -ip -mp" F77=ifort FLAGS="-g -O3 -mp -openmp" CXX=icpc CXXFLAGS="-g -O3 -mp -openmp" FC=ifort FCFLAGS="-g -O3 -mp -openmp" ICC_LIBS=/opt/intel/composerxe/lib/intel64 IFC_LIBS=/opt/intel/composerxe/lib/intel64 LDFLAGS="-L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib" SHLIB_CXXLD=icpc SHLIB_CXXLDFLAGS=-shared 9

Compiling R to call MKL s BLAS and LAPACK libraries (rather than stock serial versions). Change your config.site file and configure call in the R source code directory. My configure script MKL_LIB_PATH="/opt/intel/composerxe/mkl/lib/intel64" export LD_LIBRARY_PATH=$MKL_LIB_PATH MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm"./configure --with-blas="$mkl" --with-lapack --prefix=/usr/local/r-mkl 10

For many BLAS and LAPACK functions calls from R, expect near linear speed up... 11

Mac users, see some veclib linking hints at http://blue.for.msu.edu/gweda17/blashints.txt Linux/Unix users, compile R with MKL as described above, or compile OpenBLAS and just redirect R s default librblas.so to libopenblas.so, e.g., /usr/local/lib/r/lib/librblas.so -> /opt/openblas/lib/libopenblas.so See http://blue.for.msu.edu/comp-notes for some simple examples of C++ with MKL and Rmath libraries along with associated Makefile files. 12