CPU-GPU Heterogeneous Computing

Similar documents
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Modern Processor Architectures. L25: Modern Compiler Design

Heterogeneous platforms

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

The Era of Heterogeneous Computing

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Overview of research activities Toward portability of performance

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerator programming with OpenACC

Hybrid Implementation of 3D Kirchhoff Migration

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

The Mont-Blanc Project

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

GPUs and Emerging Architectures

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

ECE 8823: GPU Architectures. Objectives

Introduction to Xeon Phi. Bill Barth January 11, 2013

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel Systems. Project topics

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Trends in HPC (hardware complexity and software challenges)

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

AMD Fusion APU: Llano. Marcello Dionisio, Roman Fedorov Advanced Computer Architectures

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

HPC future trends from a science perspective

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CellSs Making it easier to program the Cell Broadband Engine processor

Parallel Programming on Ranger and Stampede

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Introduction to GPGPU and GPU-architectures

High Performance Computing (HPC) Introduction

Heterogeneous Computing with a Fused CPU+GPU Device

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Heterogeneous Computing and OpenCL

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

GPUfs: Integrating a file system with GPUs

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

HETEROGENEOUS CPU+GPU COMPUTING

Accelerators in Technical Computing: Is it Worth the Pain?

How to Optimize Geometric Multigrid Methods on GPUs

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Steve Scott, Tesla CTO SC 11 November 15, 2011

CUDA GPGPU Workshop 2012

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

ECE 574 Cluster Computing Lecture 23

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

HPC with Multicore and GPUs

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

AutoTune Workshop. Michael Gerndt Technische Universität München

The Mont-Blanc approach towards Exascale

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Antonio R. Miele Marco D. Santambrogio

Preparing for Highly Parallel, Heterogeneous Coprocessing

GPU-centric communication for improved efficiency

Real-Time Rendering Architectures

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

StarPU: a runtime system for multigpu multicore machines

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction to CUDA Programming

Addressing Heterogeneity in Manycore Applications

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Lecture 1: Gentle Introduction to GPUs

When MPPDB Meets GPU:

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Chapter 3 Parallel Software

Introduction to parallel Computing

OpenStaPLE, an OpenACC Lattice QCD Application

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Transcription:

CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1

Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems and Techniques Workload division Frameworks and tools Fused HCS Energy Saving with HCT Intelligent workload division Dynamic Voltage/Frequency Scaling (DVFS) Conclusion Programming aspects Energy aspects 2

Introduction 3

Introduction Source: [2] Grand Goal in HPC Exascale systems until the year ~2020 Problems Computational Power Now: up to 7GF/W Exascale: >=50GF/W Power Budget ~20MW Heat Dissipation Source: [1] Compare: #1 TOP500: ~33PF @ 1,9GF/W 4

Introduction CPU Few cores (<= 20) High frequency (~3GHz) Large caches, plenty of (slow) memory (<= 1TB) Latency oriented GPU Many cores (> 1000) Slow frequency (<=1GHz) Fast memory, limited in size (<= 12GB) Throughput oriented 5

Introduction Ways increase Energy Efficiency: Get the most computational power from both domains Utilize the sophisticated powersaving techniques modern CPU/GPUs offer Terminology: HCS: Heterogeneous Computing System (hardware) HCT: Heterogeneous Computing Technique (software) PU: Processing Unit (can be both, CPU and GPU) FLOPs: Floating Point Operations per second DP: Double Precision SP: Single Precision BLAS: Basic Linear Algebra Subprograms SIMD: Single Instruction Multiple Data 6

Heterogeneous Computing Techniques (HCT) Runtime Level 7

HCT - Basics Worst case: Only one PU is active at a time Ideal case: All PUs do (useful) work simultaneously CPU GPU CPU GPU 8

HCT - Basics Examples are idealized Real world applications consist of several different patterns Typical Processing Units (PU) in HCS Tens of CPU cores/threads Several 1000 GPU cores/kernels Goals of HCT All PUs have to be utilized (in a useful way) 9

HCT Workload Division Basic Idea: Divide the whole problem into smaller chunks Assign each sub-task to a PU compare: PCAM, [5] Problem Partition Communicate Agglomerate Sub-Task 0......... Map Sub-Task 1...... Sub-Task n 10

HCT Workload Division (naive) Example: Dual-Core System CPU + GPU Naive data distribution CPU core 0 Master/Arbiter CPU core 1 Worker GPU Worker Huge idle periods for GPU and CPU core 0 core0 GPU core1 11

HCT Workload Division (relative PU performance) Approach: use relative performance of each PU as metric A microbenchmark or performance model deemed the GPU 3x faster than than the CPU Partition the work in a 3:1 ratio to the PUs Task granularity and the quality/nature of the microbenchmark are the key factors here core0 GPU core1 12

HCT Workload Division (characteristics of sub-tasks) Idea: Use the nature of the sub-tasks to leverage performance CPU affine tasks GPU affine tasks tasks which run roughly equally well on all PUs Problem 0...... Sub-Task 1...... Sub-Task... Sub-Task n 13

HCT Workload Division (nature of sub-tasks) Map the tasks to the PU it performs best on Latency: CPU Throughput: GPU Further scheduling metrics: Capability (of the PU) Locality (of the data) Criticality (of the task) Availability (of the PU) CPU1 CPU0 GPU 14

HCT Workload Division (pipeline) If overlap is possible: Pipeline Call kernels asynchronously to hide latency Small penalty to fill and drain the pipeline Good utilization of all PUs if the pipeline is full PUn Task A.3 Task B.3 Task C.3 PU1 Task A.2 Task B.2 Task C.2 PU0 Task A.1 Task B.1 Task C.1 15

HCT Workload Division (relative PU performance) Summary: Metrics for workload division Performance of PUs Nature of sub-tasks Order Regular Patterns --> GPU (little communication) Irregular Patterns --> CPU (lots of communication) Memory Footprint Historical Data How well did each PU perform in the previous step? Availability of PU Is there a function/kernel for the desired PU? Is the PU able to take a task (schedulingwise)? Fits into VRAM? --> GPU Too Big? --> CPU BLAS-Level BLAS-1/2 --> CPU (Vector-Vector. Vector- Matrix operations) BLAS-3 --> GPU (Matrix-Matrix operations 16

Heterogeneous Computing Techniques (HCT) Frameworks and tools 17

HCT Framework Support Implementing these techniques is tedious and error-prone Better: Let a framework do this job! Framework for load-balancing Compile-Time Level (static scheduling) Runtime Level (dynamic scheduling) Framework for parallel-abstraction Write the algorithm as a sequential program and let the tools figure out how to utilize the PUs optimally Sourcecode annotations to give the run-time/compiler hints what approach is the best (comp.: OpenMP #pragma_omp_xxx) Scheduling: dynamic or static Partitioning and work-division principles shown before apply here as well! 18

HCT Framework Support Generic PU specific tools and frameworks CUDA+Libraries (Nvidia GPU) OpenMP, Pthreads (CPU) Generic heterogenous-aware frameworks OpenCL, OpenACC OpenMP ( offloading, since v4.0) CUDA (CPU-callback) Custom Frameworks (interesting Examples) Compile-Time Level (static scheduling approach) SnuCL Run-Time Level (dynamic scheduling approach) PLASMA 19

HCT Framework Support (Example: SnuCL) Creates a virtual node with all the PUs of a Cluster Use a message passing interface (MPI) to distribute the workloads to the distant PUs Inter-Node communication is implicit Source: [4] 20

HCT Framework Support (Example: PLASMA) Intermediate representation Independent of PU PU-specific implementation based on IR Utilizes the PU's specific SIMD capabilities A Runtime decides assignment to PU dynamically Speed-Up depends on workload CPU Code GPU Original Code IR Code Runtime GPU Code CPU 21

Heterogeneous Computing Techniques (HCT) Fused HCS 22

HTC Fused HCS CPU and GPU share the same die and the same address space! Communication paths are significantly shorter AMD Fusion APU x86 + OpenCL Intel Sandy Bridge and successors x86 + OpenCL Nvidia Tegra ARM + CUDA 23

Energy saving with HCS 24

Energy saving with HCS Trade-of Performance vs. energy consumption Modern PUs are delivered with extensive power-saving features e.g.: Power Regions, Clock Gating Less aggressive energy saving in HPC Reason: get rid of state transition penalties Aggressive ES in mobile/embedded Battery-life is everything in this domain 25

Energy saving with HCS (hardware: DVFS) Dynamic Voltage/Frequency Scaling (DVFS) P = C*V²*f with f~v; C=const. Reduce f by 20% P: -50%! How far can we lower f/v to meet our timing constraints? 26

Energy saving with HCS (software: intelligent work distribution) Intelligent Workload Partitioning Power model of tasks and PU Assign tasks with respect to power model Take communication overhead into account 27

Conclusion 28

Conclusion (programming aspects) Trade of Usability vs. performance Portability vs. performance High Performance requires high programming efort Raw CUDA/OpenMP Code-abstracting frameworks can support developers to a certain degree OpenCL, OpenACC, custom solutions Portability Ease of programming Performance Dynamic scheduling frameworks can accelerate an application Complicated cases, many PUs 29

Conclusion (energy aspects) HCS do contribute to a better performance/watt ratio Intelligent workload partitioning equally important for performance and energy saving DVFS is a key technique for energysaving Fused HCS can fill nichès where communication is a key factor 30

Thank you for your attention! Questions? 31

References Papers: A Survey of CPU-GPU Heterogeneous Computing Techniques (Link) SnuCL: an OpenCL Framework for Heterogeneous CPU/GPU Clusters (Link) PLASMA: Portable Programming for SIMD Heterogeneous Accelerators (Link) Images: [1] http://www.hpcwire.com/2015/08/04/japan-takes-top-three-spots-on-green500-list/ [2] http://www.top500.org/lists/2015/11/ [3] https://en.wikipedia.org/wiki/performance_per_watt#flops_per_watt [4] http://snucl.snu.ac.kr/features.html [5] http://www.mcs.anl.gov/~itf/dbpp/text/node15.html 32