Enabling the ARM high performance computing (HPC) software ecosystem

Size: px
Start display at page:

Download "Enabling the ARM high performance computing (HPC) software ecosystem"

Transcription

1 Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016

2 Are these supercomputers? For example, the Samsung S6 No doubt it is pretty amazing Four Cortex-A53 cores (1.5GHz) Four Cortex-A57 cores (2.1GHz) A Mali GPU (772MHz) Random Googling * gives performance as 34.6GFLOPs That means it can do floating point calculations every second Note Intel Haswells are now up to about 44 GFLOPs per (3.2GHz) core Actually that would have been the world s most powerful computer back in 1992 * 2

3 GFLOPS The Road to Exascale 1E No. No. 1 1 No. 1 No. No PFLOPs 93 PFLOPs EXASCALE PETASCALE Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy TERASCALE 34.6 GFLOPs GIGASCALE

4 So what do people really solve on HPC systems? Weather and climate modelling New Met Office machine has over cores, 2PB RAM and weighs 140 tonnes Computational Fluid Dynamics Modelling cars, planes, beaches, blood, Computational chemistry Molecular dynamics, Quantum interactions Atomic weapons simulations Don t mess with the nuclear stockpile Earth s mantle, galaxy formation, biological processes You name it, someone s modelling it at scale Kitware/R.N. Elias 4

5 Parallel programming in HPC Mainly Fortran (77, 90, 95, 2003), some C and C++, no Java, Python as glue Multiple pipelines, FMA Relies on architecture Vectorization Relies on compiler mainly OpenMP Source code instrumentation specifying loops how may be parallelized MPI Message passing explicitly stating how many bytes to send where 5

6 Scalable Vector Extensions (SVE) 6

7 Introducing the Scalable Vector Extension (SVE) General Purpose 64-bit ARMv8-A Scalable wide vectors Extending processing capability 7

8 Introducing Scalable Vector Extension (SVE) Extending ARMv8-A with AArch64 extension which expands vector length up to a maximum of 2048 bits Expands fine-grain data parallelism for HPC scientific workloads Better compiler target, reduces software deployment effort Beginning engagement with open-source community and wider ARM ecosystem 8

9 Post-K Japanese supercomputer 100x capacity 50x capability 15x efficiency ARMv8-A with SVE 9

10 ARM HPC Ecosystem 10

11 ARM HPC ecosystem roadmap Released Planned Concept Hardware AppliedMicro X-Gene 1 & 2 AMD Seattle Cavium ThunderX AppliedMicro X-Gene 3 Phytium Mars Cavium ThunderX2 Fujitsu Post K (SVE) Open-Source software OpenHPC 1.2 ARM Optimized Routines ARM Optimized Routines vector versions Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM Flang ARM C/C++ Compiler ahead of LLVM trunk ARM Fortran Compiler ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruction Emulator ISV software Allinea DDT and MAP NAG Library & Compiler PathScale ENZO Rogue Wave TotalView ISV software Future

12 now on ARM OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments Functional Areas Components include Base OS RHEL/CentOS 7.1, SLES 12 ARM s participation: Silver member of OpenHPC ARM is on OpenHPC Technical Steering Committee in order to drive ARM architecture build support Status (November 2016): release out now All packages built on ARMv8 for both CentOS and SUSE ARM-based machines are being used for building and also in the OpenHPC build infrastructure Administrative Tools Provisioning Resource Mgmt. I/O Services Numerical/Scientifi c Libraries I/O Libraries Compiler Families MPI Families Development Tools Performance Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun Warewulf SLURM, Munge. Altair PBS Pro* Lustre client (community version) Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps HDF5 (phdf5), NetCDF (including C++ and Fortran interfaces), Adios GNU (gcc, g++, gfortran) OpenMPI, MVAPICH2 Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy PAPI, Intel IMB, mpip, pdtoolkit TAU 12

13 ARM HPC tools portfolio ARM C/C++ Compiler COMMERCIALLY SUPPORTED FOR HPC APPLICATIONS ARM Performance Libraries BLAS, LAPACK and FFT MICRO-ARCHITECTURALLY TUNED ARM Code Advisor ACTIONABLE ADVICE TO OPTIMIZE YOUR CODE ARM SVE C/C++ Compiler COMPILER SUPPORT FOR ARM SCALABLE VECTOR EXTENSION ARM Instruction Emulator DEVELOP SOFTWARE FOR TOMORROW S HARDWARE TODAY 13

14 ARM Code Advisor (Beta) Combines static and dynamic information to produce actionable insights Performance Advice Compiler vectorization hints. Compilation flags advice. Fortran subarray warnings. OpenMP instrumentation. Insights from compilation and runtime Compiler Insights are embedded into the application binary by the ARM Compilers. OMPT interface used to instrument OpenMP runtime. Extensible Architecture Users can write plugins to add their own analysis information. Data accessible via web-browser, command-line, and REST API to support new user interfaces. 14

15 ARM Code Advisor (Beta) Typical workflow Source Code Compile Compiled Binary +Insight Profile Runtime Profile Analyse Web View HTTP 15

16 ARM Performance Libraries Optimized BLAS, LAPACK and FFT Commercial 64-bit ARMv8 math libraries Commonly used low-level math routines - BLAS, LAPACK and FFT. Validated with NAG s test suite, a de-facto standard. Best-in-class performance with commercial support Tuned by ARM for Cortex-A72, Cortex-A57 and Cortex-A53. Maintained and supported by ARM for a wide range of ARM-based SoCs. Regular benchmarking against open source alternatives. Performance on par with best-in-class math libraries Commercially Supported by ARM Silicon partners can provide tuned micro-kernels for their SoCs Partners can collaborate directly working with our source-code and test suite. Alternatively they can contribute through open source route. Validated with NAG test suite 16

17 Deep dive into optimizing DGEMM 17

18 DGEMM The maths Double precision GEneral Matrix-Matrix multiplication C = aa x B + bc Normally assume a=1, b=0 however for a BLAS implementation all must be catered for Also matrices are not necessarily square: A is m x k, B is k x n, C is m x n Not to mention the allocated storage may have the matrices as a small part of a wider space needing extra parameters to handle this 18

19 Coding DGEMM Naïve for (j=0; j<n; j++) for (i=0; i<n; i++) i for (k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; C j A k B ( ) ( )( ) = k In C memory access is stride 1 in the second array index Good access to A Very bad access to B 19

20 DGEMM Loop reordering Want better use of data from loaded cache lines Make a[i][k] loop invariant for the inner loop for (k=0; k<n; k++) for (i=0; i<n; i++) for (j=0; j<n; j++) c[i][j] += a[i][k]*b[k][j]; ( ) ( )( ) = i C j A k B k Memory access for B and C is now good However these cache lines will need reloading for next element of A Enables automatic vectorization as by-product of optimization 20

21 DGEMM Loop unrolling Want better reuse of data from loaded cache lines Unroll outer loop to enable multiple A values to be used without further loads for (k=0; k<n; k+=4) for (i=0; i<n; i++) C j A k B for (j=0; j<n; j++) ( ) ( )( ) c[i][j] += a[i][k]*b[k][j] i k + a[i][k+1]*b[k+1][j] = + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; Memory access for A now uses more data from loaded cache line Cache big enough for multiple lines of B to be loaded to update single element of C Clean-up loop needed for non-multiples of unrolling factor 21

22 DGEMM Cache blocking Small matrices that fit better in cache solve faster Therefore splitting the matrix up into blocks in each direction Note I, J, K have unlisted assignment based on i and ii, j and jj, and k and kk for (ii=0; ii<n; ii+=blk) for (kk=0; kk<n; k+=blk) for (jj=0; jj<n; jj+=blk) for (k=0; k<blk; k+=4) for (i=0; i<blk; i++) for (j=0; j<blk; j++) C A B ( ) ( )( ) = c[i][j] += a[i][k]*b[k][j] + a[i][k+1]*b[k+1][j] + a[i][k+2]*b[k+2][j] + a[i][k+3]*b[k+3][j]; 22

23 DGEMM Adding OpenMP Parallelism is key to getting the best performance Ideally want to ensure that each thread can be working on updating own values Arrange parallelism to avoid locks on data Possibly use of cache topology to extract further performance Shared L2 & L3 cache Limit number of threads for small problems 23

24 DGEMM Register blocking Calculate as many elements of C for which there are available registers Work through current block with a block of registers 32 SIMD registers 8 * 3 for accumulators to C 4 to load from A 3 to load from B Less flow control C register block A B = 24

25 DGEMM Memory reordering GEMMs are O(N 3 ) Memory reordering is O(N 2 ) Worthwhile if we can improve performance in the kernel Transposing B means we can read A & B sequentially Interleaving rows Multiple rows can be loaded from single memory stream Vector registers can be populated from sequential reads Fewer memory streams means: Better prefetching Fewer cache misses at end of rows B interleaved B B T 25

26 Performance gains by moving to assembly Guaranteed instruction ordering No extra bits Need for good performance on in-order micro architectures Explicit vectorization Need to manage alignment ourselves FMA instructions explicitly included Optimal register utilization and instruction selection Explicit memory prefetching Getting data in memory before we need to use it Reduce the cost of memory stalls 26

27 Summary Machines today are all now multicore Parallel computing is the way to effectively use parallel machines Both for server and scientific applications Efficient code needs careful design to effectively scale up to many cores Just writing serial code and expecting it to parallelize will not work As machines get larger the energy costs are reaching Megawatts hence lower power technologies are more important than ever ARM HPC is well placed to be a major rival to current legacy architectures Work is happening today to mean the ARM HPC software ecosystem is ready to support our partners deployments 27

28 The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited

ARM High Performance Computing

ARM High Performance Computing ARM High Performance Computing Eric Van Hensbergen Distinguished Engineer, Director HPC Software & Large Scale Systems Research IDC HPC Users Group Meeting Austin, TX September 8, 2016 ARM 2016 An introduction

More information

Bootstrapping a HPC Ecosystem

Bootstrapping a HPC Ecosystem Bootstrapping a HPC Ecosystem Eric Van Hensbergen Fellow Senior Director of HPC Software and Large Scale Systems Research Teratech Forum June 19, 2018 Copyright ARM computing is everywhere #1 shipping

More information

ARM BOF. Jay Kruemcke Sr. Product Manager, HPC, ARM,

ARM BOF. Jay Kruemcke Sr. Product Manager, HPC, ARM, ARM BOF Jay Kruemcke Sr. Product Manager, HPC, ARM, POWER jayk@suse.com @mr_sles SUSE and the High Performance Computing Ecosystem Partnerships with HPE, Arm, Cavium, Cray, Intel, Microsoft, Dell, Qualcomm,

More information

ARM Performance Libraries Current and future interests

ARM Performance Libraries Current and future interests ARM Performance Libraries Current and future interests Chris Goodyer Senior Engineering Manager, HPC Software Workshop on Batched, Reproducible, and Reduced Precision BLAS 25 th February 2017 ARM Performance

More information

HPC Network Stack on ARM

HPC Network Stack on ARM HPC Network Stack on ARM Pavel Shamis (Pasha) Principal Research Engineer ARM Research ExaComm 2017 06/22/2017 HPC network stack on ARM? 2 Serious ARM HPC deployments starting in 2017 ARM Emerging CPU

More information

Beyond Hardware IP An overview of Arm development solutions

Beyond Hardware IP An overview of Arm development solutions Beyond Hardware IP An overview of Arm development solutions 2018 Arm Limited Arm Technical Symposia 2018 Advanced first design cost (US$ million) IC design complexity and cost aren t slowing down 542.2

More information

Software Ecosystem for Arm-based HPC

Software Ecosystem for Arm-based HPC Software Ecosystem for Arm-based HPC CUG 2018 - Stockholm Florent.Lebeau@arm.com Ecosystem for HPC List of components needed: Linux OS availability Compilers Libraries Job schedulers Debuggers Profilers

More information

Arm's role in co-design for the next generation of HPC platforms

Arm's role in co-design for the next generation of HPC platforms Arm's role in co-design for the next generation of HPC platforms Filippo Spiga Software and Large Scale Systems What it is Co-design? Abstract: Preparations for Exascale computing have led to the realization

More information

Jay Kruemcke Sr. Product Manager, HPC, Arm,

Jay Kruemcke Sr. Product Manager, HPC, Arm, Jay Kruemcke Sr. Product Manager, HPC, Arm, POWER jayk@suse.com @mr_sles What s changed in the last year? 1.More capable Arm server chips New processors from Cavium, Qualcomm, HiSilicon, Ampere 2.Maturing

More information

The Arm Technology Ecosystem: Current Products and Future Outlook

The Arm Technology Ecosystem: Current Products and Future Outlook The Arm Technology Ecosystem: Current Products and Future Outlook Dan Ernst, PhD Advanced Technology Cray, Inc. Why is an Ecosystem Important? An Ecosystem is a collection of common material Developed

More information

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited Arm in HPC Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm 2019 Arm Limited Arm Technology Connects the World Arm in IOT 21 billion chips in the past year Mobile/Embedded/IoT/ Automotive/GPUs/Servers

More information

Intel HPC Orchestrator System Software Stack Providing Key Building Blocks for Intel Scalable System Framework

Intel HPC Orchestrator System Software Stack Providing Key Building Blocks for Intel Scalable System Framework Intel HPC Orchestrator System Software Stack Providing Key Building Blocks for Intel Scalable System Framework Nivi Muthu Srinivasan Technical Support Manager, Intel Corporation 1 Legal Notices and Disclaimers

More information

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

The Mont-Blanc project Updates from the Barcelona Supercomputing Center montblanc-project.eu @MontBlanc_EU The Mont-Blanc project Updates from the Barcelona Supercomputing Center Filippo Mantovani This project has received funding from the European Union's Horizon 2020 research

More information

OpenHPC: Project Overview and Updates

OpenHPC: Project Overview and Updates http://openhpc.community OpenHPC: Project Overview and Updates Karl W. Schulz, Ph.D. Software and Services Group, Intel Technical Project Lead, OpenHPC 5 th Annual MVAPICH User Group (MUG) Meeting August

More information

Barcelona Supercomputing Center

Barcelona Supercomputing Center www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives:

More information

GOING ARM A CODE PERSPECTIVE

GOING ARM A CODE PERSPECTIVE GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing

More information

Butterfly effect of porting scientific applications to ARM-based platforms

Butterfly effect of porting scientific applications to ARM-based platforms montblanc-project.eu @MontBlanc_EU Butterfly effect of porting scientific applications to ARM-based platforms Filippo Mantovani September 12 th, 2017 This project has received funding from the European

More information

Webinar Tips and Tricks for Porting HPC Apps

Webinar Tips and Tricks for Porting HPC Apps Webinar Tips and Tricks for Porting HPC Apps 2018 Arm Limited John C. Linford 6 December 2018 Arm is changing the game in HPC Sandia s Astra supercomputer is the first Arm-based

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Arm in High Performance Computing: Fortran on AArch64

Arm in High Performance Computing: Fortran on AArch64 Arm in High Performance Computing: Fortran on AArch64 Nathan Sircombe Arm Manchester nathan.sircombe@arm.com 70% of the world s population uses Arm technology 2 Total computing experience Consumer Arm

More information

Arm Processor Technology Update and Roadmap

Arm Processor Technology Update and Roadmap Arm Processor Technology Update and Roadmap ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture

More information

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing Innovative Alternate Architecture for Exascale Computing Surya Hotha Director, Product Marketing Cavium Corporate Overview Enterprise Mobile Infrastructure Data Center and Cloud Service Provider Cloud

More information

TOSS - A RHEL-based Operating System for HPC Clusters

TOSS - A RHEL-based Operating System for HPC Clusters TOSS - A RHEL-based Operating System for HPC Clusters Supercomputing 2017 Red Hat Booth November 14, 2017 Ned Bass System Software Development Group Leader Livermore Computing Division LLNL-PRES-741473

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

The Mont-Blanc Project

The Mont-Blanc Project http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of

More information

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD S X86 OPEN64 COMPILER. Michael Lai AMD AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Arm crossplatform. VI-HPS platform October 16, Arm Limited Arm crossplatform tools VI-HPS platform October 16, 2018 An introduction to Arm Arm is the world's leading semiconductor intellectual property supplier We license to over 350 partners: present in 95% of

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

High Performance Computing An introduction talk. Fengguang Song

High Performance Computing An introduction talk. Fengguang Song High Performance Computing An introduction talk Fengguang Song fgsong@cs.iupui.edu 1 2 Content What is HPC History of supercomputing Current supercomputers (Top 500) Common programming models, tools, and

More information

Programming Environment 4/11/2015

Programming Environment 4/11/2015 Programming Environment 4/11/2015 1 Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent interface

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and

More information

LLVM and Clang on the Most Powerful Supercomputer in the World

LLVM and Clang on the Most Powerful Supercomputer in the World LLVM and Clang on the Most Powerful Supercomputer in the World Hal Finkel November 7, 2012 The 2012 LLVM Developers Meeting Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer

HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer Mvapich User Group Mee:ng, 2017 Annapolis, MD Arm Overview An introduc0on to Arm Arm is the world's leading semiconductor intellectual

More information

Introduction of Oakforest-PACS

Introduction of Oakforest-PACS Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU

More information

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Post-K: Building the Arm HPC Ecosystem

Post-K: Building the Arm HPC Ecosystem Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach

More information

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Today (2014):

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented

More information

Optimizing Compilers Most modern compilers are good at basic code optimisations. Typical Optimization Levels. Aliasing. Caveats. Helping the Compiler

Optimizing Compilers Most modern compilers are good at basic code optimisations. Typical Optimization Levels. Aliasing. Caveats. Helping the Compiler High Performance Computing - Optimizing a Serial Code Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Optimizing Compilers Most modern compilers are good at basic code optimisations But memory optimisations

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012 Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT

More information

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS HPC User Forum, 7 th September, 2016 Outline of Talk Introduction of FLAGSHIP2020 project An Overview of post K system Concluding Remarks

More information

Linux HPC Software Stack

Linux HPC Software Stack Linux HPC Software Stack Makia Minich Clustre Monkey, HPC Software Stack Lustre Group April 2008 1 1 Project Goals Develop integrated software stack for Linux-based HPC solutions based on Sun HPC hardware

More information

Arm's role in co-design for the next generation of HPC platforms

Arm's role in co-design for the next generation of HPC platforms Arm's role in co-design for the next generation of HPC platforms Arm HPC workshop at SJTU July 2018 Filippo Spiga Software and Large Scale Systems Outline Arm and High Performance Computing Introducing

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

Portable Power/Performance Benchmarking and Analysis with WattProf

Portable Power/Performance Benchmarking and Analysis with WattProf Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly

More information

Intel + Parallelism Everywhere. James Reinders Intel Corporation

Intel + Parallelism Everywhere. James Reinders Intel Corporation Intel + Parallelism Everywhere James Reinders Intel Corporation How to win at parallel programming 2 My Talk Hardware Parallelism and some insights INNOVATION: vectorization INNOVATION: tasking 3 Helping

More information

ARM instruction sets and CPUs for wide-ranging applications

ARM instruction sets and CPUs for wide-ranging applications ARM instruction sets and CPUs for wide-ranging applications Chris Turner Director, CPU technology marketing ARM Tech Forum Taipei July 4 th 2017 ARM computing is everywhere #1 shipping GPU in the world

More information

IT4Innovations national supercomputing center. Branislav Jansík

IT4Innovations national supercomputing center. Branislav Jansík IT4Innovations national supercomputing center Branislav Jansík branislav.jansik@vsb.cz Anselm Salomon Data center infrastructure Anselm and Salomon Anselm Intel Sandy Bridge E5-2665 2x8 cores 64GB RAM

More information

Toward Building up ARM HPC Ecosystem

Toward Building up ARM HPC Ecosystem Toward Building up ARM HPC Ecosystem Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Sept. 12 th, 2017 0 Outline Fujitsu s Super computer development history and Post-K

More information

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer ryan.hulguin@arm.com Agenda Introduction Overview of Allinea Products

More information

SUSE Linux Entreprise Server for ARM

SUSE Linux Entreprise Server for ARM FUT89013 SUSE Linux Entreprise Server for ARM Trends and Roadmap Jay Kruemcke Product Manager jayk@suse.com @mr_sles ARM Overview ARM is a Reduced Instruction Set (RISC) processor family British company,

More information

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

ARM TrustZone for ARMv8-M for software engineers

ARM TrustZone for ARMv8-M for software engineers ARM TrustZone for ARMv8-M for software engineers Ashok Bhat Product Manager, HPC and Server tools ARM Tech Symposia India December 7th 2016 The need for security Communication protection Cryptography,

More information

Meeting of the Technical Steering Committee (TSC) Board

Meeting of the Technical Steering Committee (TSC) Board http://openhpc.community Meeting of the Technical Steering Committee (TSC) Board Tuesday, July 31 th 2018 11:00am ET Meeting Logistics https://zoom.us/j/556149142 United States : +1 (646) 558-8656 -Meeting

More information

User Training Cray XC40 IITM, Pune

User Training Cray XC40 IITM, Pune User Training Cray XC40 IITM, Pune Sudhakar Yerneni, Raviteja K, Nachiket Manapragada, etc. 1 Cray XC40 Architecture & Packaging 3 Cray XC Series Building Blocks XC40 System Compute Blade 4 Compute Nodes

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

Code-Agnostic Performance Characterisation and Enhancement

Code-Agnostic Performance Characterisation and Enhancement Code-Agnostic Performance Characterisation and Enhancement Ben Menadue Academic Consultant @NCInews Who Uses the NCI? NCI has a large user base 1000s of users across 100s of projects These projects encompass

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

Meeting of the Technical Steering Committee (TSC) Board

Meeting of the Technical Steering Committee (TSC) Board http://openhpc.community Meeting of the Technical Steering Committee (TSC) Board Tuesday, October 17th 11:00am ET Meeting Logistics https://www.uberconference.com/jeff_ef United States : +1 (510) 224-9559

More information

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016 AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP

More information

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED Post-K Supercomputer Overview 1 Post-K supercomputer overview Developing Post-K as the successor to the K computer with RIKEN Developing HPC-optimized high performance CPU and system software Selected

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Introduction to PICO Parallel & Production Enviroment

Introduction to PICO Parallel & Production Enviroment Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it

More information