Heterogeneous Computing with a Fused CPU+GPU Device

Similar documents
MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Reporting Performance Results

CPU-GPU Heterogeneous Computing

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

ECE 8823: GPU Architectures. Objectives

Heterogeneous platforms

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

The Benefits of GPU Compute on ARM Mali GPUs

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Performance of Multicore LUP Decomposition

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Evolution of Virtual Machine Technologies for Portability and Application Capture. Bob Vandette Java Hotspot VM Engineering Sept 2004

Modern Processor Architectures. L25: Modern Compiler Design

Altera SDK for OpenCL

Enable AI on Mobile Devices

Heterogeneous Computing Made Easy:

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Parallels Virtuozzo Containers

Graph Streaming Processor

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency

XPU A Programmable FPGA Accelerator for Diverse Workloads

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

FPGA Acceleration of 3D Component Matching using OpenCL

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

What are the major changes to the z/os V1R13 LSPR?

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Deploying Application and OS Virtualization Together: Citrix and Virtuozzo

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

OpenCL for programming shared memory multicore CPUs

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Multi2sim Kepler: A Detailed Architectural GPU Simulator

CS3350B Computer Architecture CPU Performance and Profiling

AN 831: Intel FPGA SDK for OpenCL

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

IsoStack Highly Efficient Network Processing on Dedicated Cores

TINY System Ultra-Low Power Sensor Hub for Always-on Context Features

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

MODERN OPERATING SYSTEMS. Chapter 3 Memory Management

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Improving Scalability of Processor Utilization on Heavily-Loaded Servers with Real-Time Scheduling

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Equinox: A C++11 platform for realtime SDR applications

The Role of Performance

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Energy-Efficient Run-time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Next Generation Visual Computing

Chapter 14 Performance and Processor Design

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Virtual Switch Acceleration with OVS-TC

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

SFS: Random Write Considered Harmful in Solid State Drives

Dealing with Heterogeneous Multicores

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Fall 2011 PhD Qualifier Exam

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

C3PO: Computation Congestion Control (PrOactive)

OpenCL* Device Fission for CPU Performance

Mali GPU acceleration of HEVC and VP9 Decoder

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

SIMULATOR AMD RESEARCH JUNE 14, 2015

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Design methodology for multi processor systems design on regular platforms

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Transcription:

with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc.

1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and CPU computing capability for high performance/efficiency computation. In the mobile SoC environment with SoC, heterogeneous computing technology exploits different on-chip computing resources to achieve high performance and efficient computing. This high performance/efficient computing can be achieved by executing the program on both CPU and GPU in parallel, or by dispatching the program to the most suitable device. Recent findings suggest that Heterogeneous Computing is a more efficient way for computing when compared to homogeneous computing because different types of computing resources may better suit different workloads. For example, the CPU is generally good at control-intensive workloads while GPU at compute-intensive ones. Therefore, it is critical to intelligently exploit the heterogeneous computing resources to improve efficiency and performance. However, it is challenging for programmers to write programs that can effectively utilize heterogeneous devices to achieve high performance/efficiency computing. Anticipating device computing capacity and program affinity (i.e. preference to device) for a specific program is not straightforward. With incorrect anticipation, performance could be degraded due to improper dispatching of jobs (of a program) to device(s). To address these problems, Mediatek has introduced the heterogeneous computing platform Device Fusion, which can efficiently execute OpenCL programs by fusing GPU and CPU computing capability. As shown in Figure 1 below, Device Fusion is able to flexibly dispatch each kernel (functional part) of an OpenCL program to the most suitable device. Moreover, for throughput-oriented programs, Device Fusion provides an infrastructure which can automatically maintain parallel execution on GPU and CPU. By using the Device Fusion, programmers can focus on algorithm development and obtain performance improvements without being affected by platform issues. Device Fusion can dispatch OpenCL programs to the most suitable device or to both devices for parallel execution. 2015 MediaTek Inc. Page 2 of 2

Figure 1. Dispatch Options in Device Fusion Two real-world case studies (large-size matrix multiplication and SuperResolution) and one popular OpenCL benchmark, CompuBenchCL, are utilized here to show the high performance/efficiency of the Device Fusion. Experiment results show that under the CompuBenchCL benchmark, Device Fusion outperforms GPU-alone and CPU-alone solutions by 20% and 151% in geometric mean, respectively. The results of large-size matrix multiplication shows the Device Fusion can provide better performance than GPU and CPU by up to 50% and 319%, respectively. Finally, the results of SuperResolution can also be improved by 55% and 59% when compared to GPU and CPU. Heterogeneous Computing Enabled by the advance of SoC manufacture capability, various Heterogeneous Computing resources are available on a single chip. Heterogeneous Computing refers to the computation of using computing resources with dissimilar instruction set architectures (ISAs). For example, a typical smartphone nowadays may be equipped with a CPU of more than 4 cores and GPU of more than 2 cores. Furthermore, it is expected that the number of CPU and GPU cores will continuously growing and more heterogeneous computing resources (e.g. DSP) will be incorporated into a single chip OpenCL OpenCL, a popular open standardized computing platform for heterogeneous computing, is designed to serve as the common high-level language for exploitation of heterogeneous computing resources. The OpenCL program can be executed on every device that supports 2015 MediaTek Inc. Page 3 of 3

the OpenCL standard APIs. For portability, an OpenCL kernel function is Just-In-Time (JIT) compiled into the corresponding ISAs for execution, after it is dispatched to an OpenCLsupported device. 2 The Problem Although OpenCL programs can be executed on all OpenCL-supported devices, the performance may not meet the programmer s expectation. The reasons are as follows. The program affinity (i.e. preference to device) for a specific program is not easy to predict. The affinity of a program could be affected by algorithm design, implementation or (and) architecture of the target device. A program running on the un-preferred device results in lower performance when comparing to the preferred device. For example, two different implementations of the same object detection algorithm, violajones_naive and violajones_lws32, shows different program affinities, leading to large performance difference when executing on CPU and GPU, as shown in the Figure 2. From the figure, executing violajones_lws32 (which prefers GPU) on the CPU results in severe performance degradation compared to the GPU. The violajones_naive shows the opposite results CPU has better performance. The figure below shows the execution time of two OpenCL kernel, violajones_naive and violajones_lws32, under CPU and GPU. Different implementations of the same algorithm lead to dramatic performance differences on CPU and GPU. Figure 2. Performance Differences on CPU and GPU The computing capability of the target devices, where the programs are going to be executed on, may not be anticipated easily (by programmers). Even if the 2015 MediaTek Inc. Page 4 of 4

programmers can focus on specific CPU and GPU as the target devices, the computing capability of CPU and GPU still cannot be anticipated in a straightforward way. This is because under the mobile device environment, the computing ability of CPU and GPU is often reduced due to power budget or the thermal constraint. Once the program is dispatched to a device with lower computing capability (below anticipation), the anticipated performance cannot be achieved. Such condition worsens for the throughput-oriented programs to be executed in parallel on both CPU and GPU. For example, the performance of parallel execution of the program Juliaset is shown in Figure 3 below. The y-axis indicates the performance of parallel execution. The x-axis shows the ratio n, where n is the ratio of jobs dispatching to GPU and 1-n is the ratio of jobs dispatching to CPU. As seen in the figure, the maximum performance occurs when the ratio is 0.56, indicating GPU and CPU handles 56% and 44% (i.e. 100% - 56%) of the jobs, respectively. With incorrect anticipation to CPU/GPU computing capability (e.g. given the ratio less than 0.5), performance of parallel execution could be lower than that of only dispatching to GPU. Figure 3. Performance of Parallel Executing Juliaset with Different Dispatch Ratios The final reason is the additional overhead for parallel execution. For programs aiming to be executed in parallel, enabling data sharing and synchronization incurs additional overhead, such as cache synchronization. Such overheads typically are hardware-dependent. Once the overheads outweigh the benefits of parallel execution, the performance could be below anticipation. Unfortunately, it is not easy for programmers to evaluate if parallel execution is beneficial by considering the overhead. 2015 MediaTek Inc. Page 5 of 5

3 Solution The Device Fusion is presented as a virtual OpenCL device. The virtual device, emulated on top of the CPU and GPU devices 1, is compliant with the standard OpenCL API so that the existing OpenCL applications can easily take advantage of the Device Fusion without any modification. The overview of Device Fusion is described in Figure 4 below. Figure 4. Overview of Device Fusion The virtual device includes three modules: kernel analysis module, parallel execution infrastructure module, and dispatch strategy module. When an OpenCL program is assigned to the virtual device, each kernel of the program will be first analyzed to collect necessary information, such as if it is worthwhile for the kernel can to be executed in parallel. If the parallel execution is allowed, the parallel execution infrastructure module is invoked to prepare the infrastructure, such as data sharing and synchronization. Finally, the dispatch strategy module determines the number of jobs (i.e. work-items of the OpenCL program) of the kernel to CPU and GPU, respectively, according to the internal load-balancing algorithm. However, if the parallel execution is not available, the virtual device tries to dispatch the entire kernel to the most suitable device utilizing the advice of the dispatch strategy module. Experiment Results Two real-world case studies and one popular OpenCL benchmark, CompuBenchCL, are utilized here to show the high performance/efficiency of the Device Fusion. The two real- 1 Note that, the standalone OpenCL CPU and GPU devices are still available. 2015 MediaTek Inc. Page 6 of 6

world case studies are large-size matrix multiplication and SuperResolution, an imageenhancement algorithm. Case Study: Matrix Multiplication Matrix multiplication is commonly used in many implementations of modern algorithms, such as object recognition and image processing. The execution times of a matrix multiplication implementation under GPU, CPU and DS are shown in Figure 5 below. Device Fusion outperforms GPU and CPU by 50% and 319%, respectively. This is the result of the parallel execution. Specifically, 72% and 28% work-items, intelligently determined by the internal algorithm of the Dispatch Strategy module, are dispatched to GPU and CPU. Figure 5. Performance of Matrix Multiplication under GPU, CPU and Device Fusion Case Study: SuperResolution SuperResolution is an image-processing algorithm which can enhance image resolution and extract significant detail from the original image. SuperResolution requires significant computation to extract this detail. In SuperResolution, the process is divided into several stages: find_neighbor, col_upsample, row_upsample, sub and blur2. Each stage is implemented as an OpenCL kernel. In this case study, three different image resolutions: 1MPixel, 2MPixel and 4MPixel are enlarged by 2.25x by using GPU, CPU and Device Fusion, respectively. The execution time of each kernel is shown in Figure 6 below. From the figure, Device Fusion outperforms both GPU and CPU in three different resolutions by up to 55% and 59%, respectively. The outperformance is the result of the parallel execution of the major kernel, find_neighbor, and the dispatch the other kernels to the most suitable device. The excellence of the Device Fusion can be shown in this case study: it can intelligently determine the best dispatch way for real-world algorithms, achieving high efficiency/performance. 2015 MediaTek Inc. Page 7 of 7

Figure 6. Execution Times Breakdown of Each Stage in SuperResolution under GPU, CPU and Device Fusion CompuBenchCL CompuBenchCL, a popular OpenCL benchmark, provides multiple comprehensive test items to measure the computing capability of the OpenCL device. Specifically, physics simulation, graphic computing, image processing and computer vision are included. The CompuBenchCL performance results are shown in the Figure 7 below. The experiment results demonstrate that under the benchmark CompuBenchCL the Device Fusion can efficiently run each test item, leading to outperformance when compared with GPU and CPU by 20% and 151%, respectively, in geometric means. Similarly to SuperResolution, the performance improvement is the result of the dispatch to the correct device(s). This indicates the Device Fusion can used to improve performance of various scenarios by using OpenCL. 2015 MediaTek Inc. Page 8 of 8

Figure 7.CompuBenchCL Performance Results 2015 MediaTek Inc. Page 9 of 9