TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Similar documents
Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Overview of research activities Toward portability of performance

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CUDA. Matthew Joyner, Jeremy Williams

HPC with Multicore and GPUs

CPU-GPU Heterogeneous Computing

Chapter 3 Parallel Software

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Trends and Challenges in Multicore Programming

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

High Performance Computing. Introduction to Parallel Computing

CUDA GPGPU Workshop 2012

Introduction to Multicore Programming

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Parallel Programming Libraries and implementations

Addressing Heterogeneity in Manycore Applications

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Advances of parallel computing. Kirill Bogachev May 2016

Advanced CUDA Optimization 1. Introduction

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

! Readings! ! Room-level, on-chip! vs.!

FPGA-based Supercomputing: New Opportunities and Challenges

Harp-DAAL for High Performance Big Data Computing

The Future of High Performance Computing

Understanding Dynamic Parallelism

Parallel Programming. Libraries and Implementations

HPC future trends from a science perspective

Porting Performance across GPUs and FPGAs

Parallel and Distributed Computing

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Introduction to Multicore Programming

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

OpenCL: History & Future. November 20, 2017

Modern Processor Architectures. L25: Modern Compiler Design

Hybrid Implementation of 3D Kirchhoff Migration

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Pedraforca: a First ARM + GPU Cluster for HPC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Higher Level Programming Abstractions for FPGAs using OpenCL

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Introduction to parallel Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

StarPU: a runtime system for multigpu multicore machines

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

GViM: GPU-accelerated Virtual Machines

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

CellSs Making it easier to program the Cell Broadband Engine processor

Parallel Systems. Project topics

To hear the audio, please be sure to dial in: ID#

OpenMP for next generation heterogeneous clusters

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

D-TEC DSL Technology for Exascale Computing

Dealing with Heterogeneous Multicores

Performance Diagnosis for Hybrid CPU/GPU Environments

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

POWER-AWARE SOFTWARE ON ARM. Paul Fox

AutoTune Workshop. Michael Gerndt Technische Universität München

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Accelerating image registration on GPUs

Technology for a better society. hetcomp.com

Did I Just Do That on a Bunch of FPGAs?

HPC Architectures. Types of resource currently in use

Elaborazione dati real-time su architetture embedded many-core e FPGA

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Parallelism paradigms

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

CME 213 S PRING Eric Darve

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Lecture 11: GPU programming

Trends in HPC (hardware complexity and software challenges)

SIGGRAPH Briefing August 2014

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Parallel Hybrid Computing Stéphane Bihan, CAPS

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Advanced Threading and Optimization

Transcription:

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018

OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware platforms SCHEDULED TASK-GRAPH COMPUTING Building scalable heterogenous software 2

BUILDING SCALABLE SOFTWARE Software Design» Initial codes are demonstrated as singlecore impelementations» This does not provide enough performance for many real world problems» In the past, to solve larger jobs, parallelism added through MPI Job CPU Results 3

MPI MPI MPI-BASED PARALLELIZATION Job» By using MPI, parallelism between nodes becomes possible» Larger problems can be solved» Processing time is reduced Node CPU Node MPI Node CPU Node Results MPI is designed for inter-node communication CPU MPI CPU 4

MODERN COMPUTING HARDWARE IS DIVERSE In 2001, there were few processing options offered. Since then, many new architectures have been introduced. 2001 Single Core CPUs, FPGAs, DSPs 2006 Cell Processor 2009 Tegra CULA Dense 2011 Zynq CULA Sparse 2013 ORNL Titan 2016 CULA Reaches 16k users 2005 2007 Floating-Point CUDA Compatible GPUs 2010 Exynos 2012 Xeon Phi TILE-Gx Platform Experience Includes: Multicore microprocessors, computer clusters, graphics processing units (GPUs), Xeon Phis, fieldprogrammable gate arrays (FPGAs), digital signal processors (DSPs), hybrid SoCs, Cell Processor, etc. 5

DIVERSE PROCESSING LANDSCAPE Developers have numerous processing architectures to choose from Multicore (CPU: x86, ARM, Power, TILE) Vector Processor/Massive Multicore (GPU: NVIDIA, Xeon Phi, ATI) Programmable Logic (FPGA: Xilinx, Altera) These architectures are also being combined into hybrid systems (NVIDIA Tegra, Xilinx Zynq, Intel, etc.) 6

MODERN COMPUTING NODE Computing nodes from large HPCs to small embedded/mobile device are becoming hybrid systems Node CPU Node CPU CPU GPU GPU 7

BURDEN IS ON SOFTWARE DEVELOPERS The days of writing software for only single-core Intel chips are past Software must be parallel scalable cross-platform hybrid future compatible 8 Mobile/ Embedded Arm Zynq Tegra Exynos Atom Desktop/ Server Intel multicore Intel Xeon Phi IBM Power Arm NVIDIA Quadro/Tesla High-Performance Computer Massive numbers of commodity components connected by fast networks

PROGRAMMING MODERN COMPUTING DEVICES Varying computing platforms and integrated hybrid systems allow for faster, more power-efficient computations but...this comes at the price of programming complexity Two Approaches 1. A team of experts in using each of these new devices 2. Tools that allow programmers to effectively use them without the need to learn and re-learn specialized skills 9

APPROACHES TO USING HYBRID SYSTEMS Libraries Experts build optimized software Users build applications on these tools Operation Scheduling (Kernels) Tools analyze software at compile time and efficiently order operations Directives Developers annotate their code with hints for the compiler #pragma less_power capture_frame(); detect_objects_lightweight(&object_detected); if(object_detected) { #pragma more_performance detect_objects_precise(&object_detected); } Task Scheduling (Applications) Tasks are scheduled at runtime based on system characteristics 10

EFFICIENT SOFTWARE REQUIRED AT TWO LEVELS Computational Kernels» Must be able to write efficient execution units» Often require multiple versions to work with multiple devices Scalable Applications» Application must be able to scale to utilize available hardware» Often multiple deployment platforms or scenarios to consider 11

12 WRITING CROSS-PLATFORM KERNELS

DATAFLOW TOOLCHAIN VISION EMP Flow Graph Toolchain Existing Code C++ ESDL Version of Bottleneck Code Source Pre-Processing and Standard Compilation Execution of Intermediate Binary Bottleneck Flow Graph Generation Standard Compiler Object File Platform-Targeted Code Generation Platform-Targeted Optimization Undefined Standard Linker Final Binary Optimized Portion» Leveraging industry-standard compilers and linking tools» Avoid duplication of effort and simplify long-term maintenance» Allow users to only rewrite the code portions that are problematic

Standard Toolchain Standard Toolchain DFG Compiler Standard Toolchain Standard Toolchain BUILDING PORTABLE COMPUTATIONAL KERNELS User-Developed Code DFG Builder EMP Tools 14

USING AN INTERMEDIATE DFG APPROACH Front-End» Generate a dataflow graph (DFG) representation from a C++ program» Function to compute, e.g. FFT Back-End» Schedule dataflow graph representation on parallel hardware

EMP EDSL: FFT EXAMPLE» Performing code analysis/pre-processing before compilation will allow us to translate user directives and placeholder types into code that is compilable by industry-standard C++ compilers» Leverage existing standard toolchains to shorten development time Proposed Syntax (non-templated) (templated) 16

COMPARISON: STANDARD C++ VS. EMP EDSL Standard C++ Code EM Photonics EDSL» EM Photonics EDSL looks similar to standard C++ code, making it intuitive for the user» In the future, automated tools can be created to take properly formatted C++ code directly into our toolchain 17

TARGETING DIFFERENT BACK-ENDS Running a function on the CPU Running a function on the GPU» Changing to the GPU backend requires the user to make minimal changes» Memory management and copying is handled transparently» No need to use the CUDA driver or runtime API or write any kernels 18

FFT OPERATION GRAPH» We ran our analysis on our base FFT example» 370 nodes were divided into 23 work sets» Work units within a set can be scheduled independently» What if we unrolled the recursion? 19

RECURSIVE UNROLLING ANDEXPANSION OFPARALLEL WORK» Goal: schedule several recursive stack frames worth of work at once» For problems that recurse a lot, this can result in the greater ability to schedule parallel work» Test: We unrolled our FFT example one additional time» Easily accomplished by setting a depth parameter for compile-time unrolling» Can also be done via dataflow graph mutation for reduced compile times» Result: 3610 nodes divided into 33 work sets» Significantly more operations per work set» Observed trend based on additional unrolling depths: 9-10x more nodes and +10 work sets for each unroll» Unroll more for highly-parallel target platforms 20

FFT SIZE (1D) BASE SINGLE-THREADED PERFORMANCE: FFT EXAMPLE» We benchmarked our Embedded Domain-Specific Language (EDSL) FFT example code against a reference implementation written in C++» Compiled with Clang in release with best optimization flags for the C++ reference» Care was taken with the C++ reference to avoid introducing easily-avoided, unnecessary copies» Achieved performance parity» Optimizations can still be performed» Our EDSL version also contains full dependency information that can be exploited for automated parallelism FFT Example Timing Results C++ Reference C++ EDSL 16384 8192 4096 2048 1024 512 256 128 64 32 1.00 10.00 100.00 1000.00 10000.00 EXECUTION TIME (MICROSECONDS) 21

22 SCALABLE SOFTWARE FOR HYBRID COMPUTERS

WORK QUEUE APPROACH TO SCALABILITY Traditional Work Partitioning Modern Problem Partitioning Job CPU Job Job is broken up into pieces and distributed to processors in the system CPU CPU CPU Create a queue of work that can be dynamically distributed to processors CPU CPU CPU CPU 23

PROPOSED SOLUTION: TASK-BASED PROGRAMMING» A task is pure function which receives inputs and delivers outputs according to a task graph» Details of task invocation can be abstracted from the programmer» Tasks can be developed, tested, and characterized as independent units» Synchronization between tasks is defined purely by connections in the task graph 24

COMPONENTS OF A TASK-BASED APPLICATION Task Definitions Graph Schedule A - Name - Inputs - Outputs D A C B F E A C B D E F 25

IMPLEMENTING TASKS: BUILDING MODULES Module 1 Constants Module 2 Constants Task Description T1:K1 T1:K2 T1:K3 T2:K1 T2:K2 Task 1 Module Builder Module 3 Constants Module 4 Constants Constants T3:K1 T4:K1 T4:K2 Kernel K1 for T1 Kernel K1 for T2 Compilers Module Builder T1:K1 T1:K2 T1:K3 T2:K1 Programmer Defined Existing Tools Kernel K1 for T3 26 Kernel K1 for T4 T2:K2 T3:K1 T4:K1 T4:K2 New Tools

IMPLEMENTING TASKS: MODULE Develop computational functionality for each targeted processing device // Call CUDA kernel dim3 dimgrid( NX/TILE_WIDTH, NZ/TILE_WIDTH ); dim3 dimblock( TILE_WIDTH, TILE_WIDTH ); primitive_variables_ker <<<dimgrid, dimblock>>> (p_cons, p_rho, p_vx, p_vy, p_vz, p_p, p_bx, p_by, p_bz, p_psic); Module UpdateVariables() // Call CPU code primitive_variables_cpu (p_cons, p_rho, p_vx, p_vy, p_vz, p_p, p_bx, p_by, p_bz, p_psic); CPU Kernel CUDA Kernel OpenCL Kernel Intel MIC Kernel [Future architectures] // Call Intel Xeon Phi code primitive_variables_mic (p_cons, p_rho, p_vx, p_vy, p_vz, p_p, p_bx, p_by, p_bz, p_psic); 27

SCHEDULER» Analyze dependencies between tasks» Dynamically determine best hardware for each task» Manage data movement between devices» Live profiling results to better schedule future tasks TASKS SCHEDULE B A C E C D D F G Scheduler A E B G F H H 28

Scheduler Upload Download TASK QUEUES & WORKER POOLS Task queue holds tasks to be executed Worker pool manages the hardware devices available for computation Scheduler assigns tasks to free devices in the worker pool, choosing best device based on worker load, data locality, and performance characteristics 29

EXAMPLE: SORAD TASK GRAPH 30

SAMPLE TASK INTERFACE: GETCHEMISTRY 31

CODE TO BUILD SORAD TASK GRAPH (SUBSET) 32

EXAMPLE TASK GRAPH 33

PERFORMANCE COMPARISON 7 Execution Times 6 5 4 3 2 1 0 Original Static CPU-only Dynamic CPUonly Execution Allocation Static Hybrid Dynamic Hybrid Dynamic scheduled versions took about 1 second to handle data allocation 34

TASK-BASED SORAD VERSION Task-Based SORAD» SORAD broken down into a component tasks and connected in a graph» Advantages: Easier to maintain, test, and extend; Can be scheduled Statically Scheduled» An upfront static schedule can be built to execute the SORAD task graph» Advantages: Better performance than regular SORAD; Better portability Dynamically Scheduled» Could be scheduled with any framework (e.g. Intel s TBB, etc.)» Not enough work to overcome dynamic scheduling overhead 35

POINT_SOLVE_5 SELECTION point_solve_x» Iterative method for solving sparse matrix product» point_solve_5» Operated on 5 x 5 square matrix» This was the version that was the focus of this effort» point_solve_n» Operator on N x N square matrix Breaking into Tasks» First the monolithic code needed to be broken into tasks» Each task has a single responsibility within the code» Tasks were then wrapped and connected into a task graph which could be executed 36

POINT_SOLVE_5 TASK GRAPH 37

MULTIPLE TASK VERSIONS point_solve_5 subtask Serial Version Used as reference baseline for testing C++ TBB Utilizing modern parallelization technology C/FORTRAN + OpenMP Able to call existing/ developed kernels 38

POINT_SOLVE_5 TASK GRAPH» Each PointSolve task calls an underlying subgraph of tasks which perform the full operation» Handles merging of output Input Data Input Data Tag Task Bookkeeper Output 39

POINT_SOLVE_5 TASK GRAPH» Each PointSolve task calls an underlying subgraph of tasks which perform the full operation» Handles merging of output» Each task node marked with data chunk bounds Input Data Input Data Tag Task Bookkeeper Output 40

POINT_SOLVE_5 SUBGRAPH» This handles connecting defined tasks for a given chunk of data Input Data Input Data Tag Task Bookkeeper Output 41

CPU USAGE: POINT SOLVE 5 WITH ALL DATA» This trace illustrates the application s task parallelism separate from the data parallelism» Where possible, the task-based scheduling framework overlapped task execution» The non-colored areas represent the idle time where additional tasks could not be scheduled due to dependencies between tasks 42

CPU USAGE: POINT SOLVE 5 WITH PARTITIONED DATA» This trace illustrates the application s task parallelism combined with the data parallelism achieved through the chunking of data» Where possible, the TBSF overlaps task execution» Much of the idle time seen in the previous trace has been filled in by tasks operating on non-dependent chunks of data.» There is still idle time seen at the end of the iteration» Iteration support now added and being tested 43

EXECUTION COMPARISON Baseline tasks utilizing OpenMP Schedule tasks without OpenMP Coarse Dataset (1 million data points) Fine Dataset (5 million data points) 7.53661 seconds 41.23417 seconds 5.115873 seconds 32.3346 seconds Speed Increase 32.12% 21.58%» A comparison was done between the tasks scheduled through the framework and the same tasks utilizing OpenMP for scheduling on the CPU» A speed increase in average execution time was seen in the framework scheduled version for both sets of reference data 44

SCALABLE HYBRID COMPUTING Compute Task UpdateVariables() Details CPU Code GPU Code Scheduling Framework Diagnostics Node 1: OK Node 2: OK Node 3: Offline OK Real time hardware monitoring and fault tolerance Performance Improve code by added new high performance kernels HPC Hardware Scale performance with hardware upgrades 45

UNIFIED STRATEGY

EM PHOTONICS TOOL STRATEGY Tools for cross-platform computational kernel development Accelerated math libraries CULA: NVIDIA-only; Affinimath: Cross-platform Dynamic task-based scheduling for heterogeneous computing systems Tools for power efficient programming multicore and heterogeneous embedded processors 47

COMBINING TOOLS Cross Platform Kernels + Dynamic Scheduling + Off the Shelf Libraries Task 1 Task 2 Task 1 Library Call Task 3 Library Call

CONCLUSION Building Modern Software» Computing landscape has changed; platform options are diverse» Tools must account for mixed device deployments Cross-Platform Kernels» Developing a tool that allows for cross-platform software development» Write a single computational kernel and build for different processors Task-Based Scheduler» Break application down into a series of interconnected tasks» Use automated tools to partition those tasks over heterogeneous hardware 49