XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization

Similar documents
XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization

Programming Support for Heterogeneous Parallel Systems

The rcuda middleware and applications

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Addressing Heterogeneity in Manycore Applications

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Install your scientific software stack easily with Spack

The Mont-Blanc approach towards Exascale

Explicit Platform Descriptions for Heterogeneous Many-Core Architectures

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Technology for a better society. hetcomp.com

Why you should care about hardware locality and how.

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Programmability and Performance Portability for Heterogeneous Many-Core Systems

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

CPU-GPU Heterogeneous Computing

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Mapping MPI+X Applications to Multi-GPU Architectures

AutoTune Workshop. Michael Gerndt Technische Universität München

ADAC Federated Testbed Creating a Blueprint for Portable Ecosystems

GPU-centric communication for improved efficiency

CafeGPI. Single-Sided Communication for Scalable Deep Learning

The Role of Standards in Heterogeneous Programming

Code Auto-Tuning with the Periscope Tuning Framework

NAMD GPU Performance Benchmark. March 2011

Pedraforca: a First ARM + GPU Cluster for HPC

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

SPOC : GPGPU programming through Stream Processing with OCaml

Re-architecting Virtualization in Heterogeneous Multicore Systems

Large scale Imaging on Current Many- Core Platforms

GROMACS Performance Benchmark and Profiling. August 2011

Hybrid Implementation of 3D Kirchhoff Migration

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

rcuda: an approach to provide remote access to GPU computational power

REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++

Cuda C Programming Guide Appendix C Table C-

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Real-Time Support for GPU. GPU Management Heechul Yun

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Overview of research activities Toward portability of performance

Overview of Project's Achievements

High Performance Computing on GPUs using NVIDIA CUDA

CUDA GPGPU Workshop 2012

Parallel Architectures

OpenFOAM Performance Testing and Profiling. October 2017

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Wu Zhiwen.

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

OpenCL: History & Future. November 20, 2017

Operational Robustness of Accelerator Aware MPI

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

To hear the audio, please be sure to dial in: ID#

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

BUILDING A GPU-FOCUSED CI SOLUTION

5th 4DIAC Users' Workshop

StarPU: a runtime system for multigpu multicore machines

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

THOUGHTS ABOUT THE FUTURE OF I/O

Tesla Cluster Monitoring & Management

Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

A cache-aware performance prediction framework for GPGPU computations

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Framework of rcuda: An Overview

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

OpenCL for programming shared memory multicore CPUs

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Institutionen för datavetenskap

Tesla GPU Computing A Revolution in High Performance Computing

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

OP2 FOR MANY-CORE ARCHITECTURES

Inspur AI Computing Platform

A Case for High Performance Computing with Virtual Machines

The Arm Technology Ecosystem: Current Products and Future Outlook

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

EE , GPU Programming

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Early Experiences Writing Performance Portable OpenMP 4 Codes

AMBER 11 Performance Benchmark and Profiling. July 2011

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

2 Installation Installing SnuCL CPU Installing SnuCL Single Installing SnuCL Cluster... 7

A MATLAB Interface to the GPU

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

COSC 6374 Parallel Computation. Parallel Computer Architectures

OpenACC 2.6 Proposed Features

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Transcription:

1 / 25 XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization Christoph Kessler, Lu Li, Aras Atalar and Alin Dobre christoph.kessler@liu.se, lu.li@liu.se

2 / 25 Agenda Motivation A Review of PDL XPDL Features XPDL Toolchain and Runtime Query API Related Work and Comparison Conclusion and Future Work

Acknowledgment 3 / 25

4 / 25 EXCESS Project (2013-2016) EU FP7 project Holistic energy optimization Embedded and HPC systems. More info: http://excess-project.eu/

5 / 25 Motivation Optimization: Platform-independent: algorithmic improvement Platform-dependent: SIMD, GPU etc Platform dependent optimization such as SIMD can yield significant performance and energy improvements. Usually platform dependent optimizations are manually tuned or partly automated. Requires understanding of the machine-specific features. Automation of systematic platform-dependent optimizations is both interesting and challenging. Adaptivity Retargetability Prerequisite: a formal platform description language modeling optimization-relevant platform properties.

6 / 25 Motivation Platform = Hardware + System Software Previous work: PDL (Platform Description Language) Limitations: flexibility and scalability. XPDL is designed to overcome those limitations, Furthermore, new features are added to better support energy optimization, such as microbenchmark generation.

PDL: the Predecessor of XPDL PDL: PEPPHER Platform Description Language. (Sandrieser et al. 12) [1] Developed in EU FP7 project: PEPPHER (2010-2012) XML-based First description language for heterogeneous systems Models the control structure of master and slave PUs. Enable conditional composition. (Dastgeer et al. 14) [2] 7 / 25

A Review of PDL Main structuring by control relation (a software aspect!) rather than hardware organization hard to change. Modularity issues. Figure 1 : A PDL example (Sandrieser et al.) [1] 8 / 25

9 / 25 XPDL Language Features Modular and extensible Control relation decoupled from hardware Syntax choice: XML Mature tool support Syntactic flavor does not restrict its applicability. System software modeling Power modeling Microbenchmark generation For statically-unknown parameter values Standardized XPDL runtime query API Available to application and tool chain: E.g. libraries, compilers, runtime systems etc.

10 / 25 A Hello World XPDL Example Examples are only to illustrate XPDL language constructs, not meant to be complete. Core L1 Core L1 Core L1 Core L1 L2 L2 L3 1 <cpu name= " Intel_Xeon_E5_2630L "> 2 <group prefix = " core_group " quantity= " 2 " > 3 <group prefix = " core " quantity= " 2 " > 4 <! Embedded d e f i n i t i o n > 5 < core frequency= " 2 " frequency_unit="ghz" > 6 <cache name= " L1 " size=" 32 " unit= " KiB " / > 7 </ core> 8 </group> 9 <cache name= " L2 " size=" 256 " unit= " KiB " / > 10 </group> 11 <cache name= " L3 " size=" 15 " u n i t = " MiB " / > 12 <power_model type= " power_model_e5_2630l " / > 13 </cpu> (a) A Typical CPU Structure (b) The corresponding XPDL code

Modular and extensible name: defining a meta-model, stores type information id: defining a model, stores object information Figure 2 : Type reference and inheritance 11 / 25

12 / 25 Control Relation Decoupled From Hardware 1 < Master i d = " 0 " q u a n t i t y = " 1 "> 2 <peppher : PEDescriptor > 3 <peppher : Property f i x e d = " t r u e " > 4 <name>architecture</name> 5 <value > x86 </ value >... 6 <peppher : Worker q u a n t i t y = " 1 " i d = " 1 "> 7 <peppher : PEDescriptor > 8 <peppher : Property f i x e d = " t r u e " > 9 <name>architecture</name> 10 <value > gpu </ value >... 11 </peppher : Worker> 12 </ Master> Listing 1 : PDL example description for x86-core (Master) and gpu (Worker) 1 <system i d = " l i u gpu server " > 2 <socket> 3 < cpu id= " gpu host " type= " Intel Xeon E5 2630L " / > 4 </ socket> 5 < device i d = " gpu1 " type= " Nvidia K20c " / > 6 <interconnects > 7 <interconnect id= " connection1 " type= " pcie3 " head= " gpu host " t a i l = " gpu1 " / > 8 </ interconnects> 9 </system> Listing 2 : XPDL example description for such a GPU server

System Software Modeling 1 <system id= " XScluster "> 2 < cluster > 3 <group prefix = " n " quantity= " 4 "> 4 < node > 5 <group id= " cpu1 "> 6 <socket> <cpu i d = "PE0" type= " I n t e l Xeon... " / > </socket> 7 </group> 8 <group prefix = " main mem" quantity= " 4 "> <memory type="ddr3 4G" / > </group> 9 <device i d = " gpu1 " type= " Nvidia K20c " / > 10 <interconnects > 11 <interconnect id= " conn1 " type= " pcie3 " head= " cpu1 " t a i l = " gpu1 " / > 12 </ interconnects> 13 </node> 14 </group> 15 <interconnects > 16 <interconnect id= " conn3 " type= " infiniband1 " head= " n1 " t a i l = " n2 " / > 17 </ interconnects> 18 </ cluster> 19 <software > 20 <hostos i d = " l i n u x 1 " type= " Linux... " / > 21 < i n s t a l l e d type= "CUDA 6.0 " path= " / ext / l o c a l / cuda6. 0 / " / > 22 < i n s t a l l e d type= "CUBLAS... " path= "... " / > 23 < i n s t a l l e d type= " StarPU 1.0 " path= "... " / > 24 </ software> 25 <properties > 26 <property name= " ExternalPowerMeter " type= "... " command= " myscript. sh " / > 27 </ properties> 28 </system> Listing 3 : Example of a concrete cluster machine 13 / 25

14 / 25 Power Modeling A power model in XPDL consists of power domains and their power state machines microbenchmarks with deployment information

15 / 25 Modeling Power Domains Power domains: hardware components that must change state together. E.g. in Movidius Myriad2, each SHAVE core form a separate power island.

16 / 25 Modeling Power Domains 1 <power_domains name= " Myriad1 power domains " > 2 <! t h i s i s l a n d i s the main i s l a n d > 3 <! and cannot be turned o f f > 4 <power_domain name= " main pd " enableswitchoff= " f a l s e " > 5 <core type= " Leon " / > 6 </power_domain> 7 <group name= " Shave pds " q u a n t i t y = " 8 " > 8 < power_domain name= " Shave pd " > 9 <core type= " Myriad1 Shave " / > 10 </power_domain> 11 </ group> 12 <! t h i s i s l a n d can only be turned o f f > 13 <! i f a l l the Shave cores are switched o f f > 14 <power_domain name= "CMX pd " 15 s w i t c h o f f C o n d i t i o n = " Shave pds o f f " > 16 <memory type= "CMX" / > 17 </power_domain> 18 </power_domains> Listing 4 : Example meta-model for power domains of Movidius Myriad1

Modeling Power State Machine Figure 3 : Intel Xscale processor (2000) with voltage scaling and shutdown. Source: Alexandru Andrei, PhD thesis 2007 17 / 25

Modeling Power State Machines Figure 4 : Intel Xscale processor (2000) 1 <power_state_machine name= " power state machine1 " 2 power_domain = " I n t e l Xscale pd " > 3 <power_states> 4 < power_state name= "P1" frequency = " 150 " frequency_unit="mhz" 5 power= " 60 " power_unit= "mw" voltage=" 0.75 " voltage="v" / > 6 <power_state name= "P2" frequency= " 600 " frequency_unit="mhz" 7 power= " 450 " power_unit= "mw" voltage=" 1.3 " voltage="v" / > 8... 9 </ power_states> 10 <transitions > 11 < t r a n s i t i o n head= "P1" t a i l = "P2" time = " 160 " time_unit=" us " / > 12 < t r a n s i t i o n head= "P2" t a i l = "P1" time= " 160 " time_unit=" us " / > 13... 14 </ transitions> 15 </ power_state_machine> Listing 5 : Example meta-model for a power state machine of Intel Xscale processor (2000) 18 / 25

Modeling Microbenchmarks With Deployment Information 1 <instructions name= " x86 base i s a " 2 mb = "mb x86 base 1" > 3 < inst name= " fmul " 4 energy = "? " energy_unit=" pj " mb= " fm1 " / > 5 < inst name= " fadd " 6 energy= "? " energy_unit=" pj " mb= " fa1 " / > 7 < inst name= " divsd " > 8 <data frequency= " 2.8 " 9 energy= " 18.625 " energy_unit=" nj " / > 10 <data frequency= " 2.9 " 11 energy= " 19.573 " energy_unit=" nj " / > 12 <data frequency= " 3.4 " 13 energy= " 21.023 " energy_unit=" nj " / > 14 </ inst> 15 </ instructions > Listing 6 : Example meta-model for instruction energy cost 1 <microbenchmarks i d = "mb x86 base 1" 2 i n s t r u c t i o n _ s e t = " x86 base i s a " 3 path= " / usr / l o c a l / micr / src " 4 command = " mbscript. sh "> 5 <microbenchmark i d = " fa1 " type= " fadd " f i l e = " fadd. c " 6 c f l a g s = " O0" l f l a g s = "... " / > 7 <microbenchmark id= "mo1" type= "mov" f i l e = "mov. c " 8 c f l a g s = " O0" l f l a g s = "... " / > 9 </microbenchmarks> Listing 7 : An example model for instruction energy cost 19 / 25

20 / 25 XPDL Toolchain and Microbenchmark Generation XPDL Schema XPDL Query API Applications and tool chain C++ API XPDL Parser Microbenchmark execution Microbenchmark generator C++ library XPDL XML Files Intermediate Representation Figure 5 : XPDL Tool Chain Diagram Code generator

21 / 25 XPDL Runtime Query API Initialization of the XPDL run-time query library Functions for browsing the model tree Functions for looking up attributes of model elements Model analysis functions for derived attributes Static inference, e.g. PCIe bandwidth Aggregate numbers, e.g. get the total static power or the total number of GPUs

Different Views on XPDL 22 / 25

Related Work and Comparisons PDL XML-based Modeling control structure cf. XPDL: control relation decoupled, modular etc Hwloc: detects and represents the hardware resources visible to the machine s operating system cf. XPDL: not limited to OS HPP-DL In EU FP7 REPARA project Purpose: to support static and dynamic scheduling of software kernels to heterogeneous platforms JSON syntax cf. XPDL: modeling of power states, dynamic energy costs, system software, distributed specifications, runtime model access or automatic microbenchmarking. 23 / 25

24 / 25 Summary of Contributions We propose XPDL, a portable and extensible platform description language Modular and extensible Software roles decoupled from hardware structure System software modeled Microbenchmark generation Open source tool chain on the way, soon at http://www.ida.liu.se/labs/pelab/xpdl/

25 / 25 Future work Continue to finish the XPDL toolchain Demonstrate applicability for adaptive energy optimization. Generate runtime library that exposes API for dynamic hardware resource management. Demonstrate applicability for retargeting static optimizers and source code generators. Write automatic converters of vendor-specificadls (e.g. hwloc) to XPDL

25 / 25 Bibliography I M. Sandrieser, S. Benkner, and S. Pllana, Using explicit platform descriptions to support programming of heterogeneous many-core systems, Parallel Computing, vol. 38, no. 1-2, pp. 52 65, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0167819111001396 U. Dastgeer and C. W. Kessler, Conditional component composition for GPU-based systems, in Proc. MULTIPROG-2014 workshop at HiPEAC-2014 conference, Vienna, Austria, Jan. 2014.