Position Paper: OpenMP scheduling on ARM big.little architecture

Similar documents
Design Exploration for next Generation High-Performance Manycore On-chip Systems: Application to big.little Architectures

Efficient Programming for Multicore Processor Heterogeneity: OpenMP Versus OmpSs

Energy Efficiency Analysis of Heterogeneous Platforms: Early Experiences

Exploration of Performance and Energy Trade-offs for Heterogeneous Multicore Architectures

A Trace-driven Approach for Fast and Accurate Simulation of Manycore Architectures

Thread Affinity Experiments

Dongjun Shin Samsung Electronics

Rodinia Benchmark Suite

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract

MAGPIE User Guide (version 1.0)

MAGPIE TUTORIAL. Configuration and usage. Abdoulaye Gamatié, Pierre-Yves Péneau. LIRMM / CNRS-UM, Montpellier

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Scheduling the Intel Core i7

Decoupled Access-Execute on ARM big.little

Evaluating the Error Resilience of Parallel Programs

Heterogeneous Architecture. Luca Benini

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Building blocks for 64-bit Systems Development of System IP in ARM

Elaborazione dati real-time su architetture embedded many-core e FPGA

Reconfigurable Multicore Server Processors for Low Power Operation

ARM big.little Technology Unleashed An Improved User Experience Delivered

An Investigation of Unified Memory Access Performance in CUDA

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Automatic Intra-Application Load Balancing for Heterogeneous Systems

F28HS Hardware-Software Interface: Systems Programming

Performance of Multicore LUP Decomposition

Parallel Computing: Parallel Architectures Jin, Hai

Adaptive Power Profiling for Many-Core HPC Architectures

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Annotatable Systrace: An Extended Linux ftrace for Tracing a Parallelized Program

The Mont-Blanc approach towards Exascale

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Performance Evaluation of Heterogeneous Microprocessor Architectures

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Cache Justification for Digital Signal Processors

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Computer Systems Architecture

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Using POSIX Threading to Build Scalable Multi-Core Applications

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Topics on Compilers Spring Semester Christine Wagner 2011/04/13

Energy-Efficient Run-time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Techniques and tools for measuring energy efficiency of scientific software applications

Comp. Org II, Spring

Simultaneous Multithreading on Pentium 4

Adaptive Cache Partitioning on a Composite Core

Parallel Processing & Multicore computers

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

A unified multicore programming model

Computer Systems Architecture

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency

Comp. Org II, Spring

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Research on the Implementation of MPI on Multicore Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Asymmetry-Aware Work-Stealing Runtimes

Cross-Layer Memory Management for Managed Language Applications

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

IBM InfoSphere Streams v4.0 Performance Best Practices

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Project Proposals. Advanced Operating Systems / Embedded Systems (2016/2017)

Computer Architecture

Compiling for HSA accelerators with GCC

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Exercise: OpenMP Programming

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

A Study on C-group controlled big.little Architecture

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

Building supercomputers from embedded technologies

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Intel VTune Amplifier XE

Transcription:

Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM (CNRS and University of Montpellier), Montpellier, France firstname.lastname@lirmm.fr Abstract. Single-ISA heterogeneous multicore systems are emerging as a promising direction to achieve a more suitable balance between performance and energy consumption. However, a proper utilization of these architectures is essential to reach the energy benefits. In this paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM big.little architecture. 1 Introduction Traditional CPUs consume just too much power and new solutions are needed to scale up to the ever-growing demand on computational complexity. Accordingly, major efforts are focusing on achieving a more holistic balance between performance and energy consumption. In this context, heterogeneous multicore architectures are firmly established as the main gateway to higher energy efficiency. Particularly interesting is the concept of single-isa heterogeneous multicore systems [1], which is an attempt to include heterogeneity at the microarchitectural level while preserving a common abstraction to the software stack. In single-isa heterogeneous multicore systems, all cores execute the same machine code and thus, any core can execute any part of the code. Such model makes it possible to execute the same OS kernel binary implemented for symmetric Chip Multi- Processors (CMPs) with only minimal configuration changes. In order to take advantage of single-isa heterogeneous multicore architectures, we need an appropriate strategy to manage the distribution of computation tasks also known as efficient thread scheduling in multithreading programming models. OpenMP [2] is a popular programming model that provides a shared memory parallel programming interface. It features a thread-based forkjoin task allocation model and various loop scheduling policies to determine the way in which iterations of a parallel loop are assigned to threads. This paper measures the impact of different loop scheduling policies in a real state-of-the-art single-isa heterogeneous multicore system. We use the Exynos 5 Octa (5422) System-on-Chip (SoC) [3] integrating the ARM big.little architecture [4], which couples relatively slower, low-power processor cores (LITTLE)

with relatively more powerful and power-hungry ones (big). We provide insightful performance and energy consumption results on the Rodinia OpenMP benchmark suite [5] and demonstrate the ineffectiveness of typical loop scheduling policies in the context of single-isa heterogeneous multicore architectures. 2 The Exynos 5 Octa (5422) SoC 2.1 Platform Description We run our experiments on the Odroid-XU3 board, which contains the Exynos 5 Octa (5422) SoC with the ARM big.little architecture. ARM big.little technology features two sets of cores: a low performance energy-efficient cluster that is called LITTLE and power hungry high performance cluster that is called big. The Exynos 5 Octa (5422) SoC architecture and its main parameters are presented in Figure 1. It contains: (1) a cluster of four out-of-order superscalar cores with 32kB private caches and 2MB L2 cache, and (2) a cluster of four in-order cores with 32kB private caches and 512KB L2 cache. Each cluster operates at independent frequencies, ranging from 200MHz up to 1.4GHz for the LITTLE and up to 2GHz for the big. The SoC contains 2GB LPDDR3 RAM, which runs at 933MHz frequency and with 2x32 bit bus achieves 14.9GB/s memory bandwidth. The L2 caches are connected to the main memory via the 64-bit Cache Coherent Interconnect (CCI) 400 [6]. %!"#! # Fig. 1: Exynos 5 Octa (5422) SoC. 2.2 Software Support Execution models. ARM big.little processors have three main software execution models [4]. The first and simplest model is called cluster migration. A single cluster is active at a time, and migration is triggered on a given workload threshold. The second mode named CPU migration relies on pairing every big core with a LITTLE core. Each pair of cores acts as a virtual core in which

only one actual core among the combined two is powered up and running at a time. Only four physical cores at most are active. The main difference between cluster migration and CPU migration models is that the four actual cores running at a time are identical in the former while they can be different in the latter. The heterogeneous multiprocessing (HMP) mode also known as Global Task Scheduling (GTS) allows using all of the cores simultaneously. Clearly, HMP provides the highest flexibility and consequently it is the promising mode to achieve the best performance/energy trade-offs. Benchmarks. We consider the Rodinia benchmark suite for heterogeneous computing [5]. It is composed of applications and kernels of different nature in terms of workload, from domains such as bioinformatics, image processing, data mining, medical imaging and physics simulation. It also includes classical algorithms like LU decomposition and graph traversal. In our experiments, the OpenMP implementations are configured with 4 or 8 threads, depending on the number of cores that are visible to the thread scheduling algorithm. Due to space constraints, we selected the following subset of benchmarks: backprop, bfs, heartwall, hotspot, kmeans openmp/serial, lud, nn, nw and srad v1/v2. Thread scheduling algorithms. OpenMP provides three loop scheduling algorithms, which allows determining the way in which iterations of a parallel loop are assigned to threads. The static scheduling is the default loop scheduling algorithm, which divides the loop into equal or almost equal chunks. This scheduling provides the lowest overhead, but, as we will show in the results, the potential load imbalance can cause significant synchronization overheads. The dynamic scheduling assignings chunks at runtime once threads complete previously assigned iterations. An internal work queue of chunk-sized blocks is used. By default, the chunk size is 1 and this can be explicitly specified by a programmer at compile time. Finally, the guided scheduling is similar to dynamic scheduling, but the chunk size exponentially decreases from the value calculated as #interations/#threads to 1 by default or to a value explicitly specified by a programmer at compile time. In the next section, we consider these three loop scheduling policies with the default chunk size. Furthermore, the experiments are run with the following software system configuration: the Ubuntu 14.04 Linux kernel LTS 3.10, the GCC 4.8.2 compiler and the OpenMPI 3.1 libraries. 3 Experimental Results In this section we present a detailed analysis of the OpenMP implementation of the Rodinia benchmark suite running on the ARM big.little architecture. We consider the following configurations: cluster running at 200 MHz, 800 MHz and 1.4 GHz; cluster running at 200 MHz, 800 MHz and 2GHz; /A15 clusters running at 200/200 MHz, 800/800 MHz, 1.4/2 GHz, 200 MHz/2 GHz and 1.4 GHz/200 MHz.

Static Thread Scheduling. Figure 2(a) shows in logarithmic scale the measured execution time of different configurations using the static scheduling algorithm. The results are normalized with respect to the slowest configuration, i.e., running at 200MHz. As expected, the highest performance is typically achieved by the running at 2GHz. For example, a speedup of 21x is observed when running the kmeans openmp in the big cluster. When using the HMP mode to simultaneously run on the big and LITTLE clusters (i.e., A7/A15 in the figure), the execution time is usually slower to that of the big cluster alone, despite using four additional actives cores. An even higher penalty is observed when operating the LITTLE cluster at a lower frequency, especially so for the lud, nn and nw applications. " "#!!! % & (a) Execution time speedup comparison % &!' (b) EtoS comparison Fig. 2: Normalized speedup using Static scheduling (reference A7 at 200MHz). Figure 2(b) shows the normalized Energy to Solution (EtoS) measured with the on-board power monitors present in the Odroid-XU3 board. Results are again normalized against the reference running at 200MHz. We observe, that the cluster is generally more energy-efficient than the. Furthermore, the best energy efficiency is achieved when operating at 800MHz. Besides, we also observe that for a few applications (i.e., bfs, kmeans serial, and srad v1 ) the running at 800MHz provides slightly better EtoS than the reference cluster. These applications benefit the most from the A15 out-of-order architecture achieving the largest speedups. This leads

Master thread OMP thread 1 OMP thread 2 OMP thread 3 OMP thread 4 OMP thread 5 OMP thread 6 OMP thread 7 0.4755s 0.4760s 0.4765s Time zoom Complete runtime Master thread OMP worker thread OMP barrier (idle) Fig. 3: lud on HMP big.little at 200MHz/2GHz. to a higher energy efficiency despite running on a core of higher power consumption. When using the HMP mode, some application exhibit a very high EtoS. Particularly high are the EtoS of the lud and nn applications executed in the configuration /A15 running at 200MHz/2GHz. Our experiments also show that HMP is less energy efficient than the big cluster running at maximum frequency (i.e., A15 2GHz). In conclusion, static thread scheduling achieves a highly suboptimal use of our heterogeneous architecture, which turns out to be slower and less energy efficient than a single big cluster. Further investigations were carried out with Scalasca [7] and Vampir [8] software tools that permit instrumenting the code and visualizing low-level behavior based on collected execution traces. Figure 3 shows a snapshot of the execution trace of the lud application alongside a zoom on two consecutive parallel-for loop constructs. It is clearly visible that the OpenMP runtime spawned eight threads, which got assigned to the eight cores. The four threads assigned to the cores completed execution of their chunks significantly faster than the cores. As a result, the execution critical path is affected by the slowest cores, which slows down system performance. Dynamic and Guided Thread Scheduling. Figures 4(a-b) respectively illustrate the execution time using dynamic and guided thread scheduling normalized by the static scheduling discussed previously. The dynamic scheduling is able to achieve good speedups for some applications (e.g., nn) but also degrades the performance of some others (e.g., nw). Something very similar happens with the guided scheduling but with different application/configuration sets. For example, the heartwall is now degraded for the 1.4GHz/200MHz configuration while the nn achieves a 1.8x speedup. Figures 4(c-d) respectively show the EtoS of the dynamic and guided scheduling normalized by the static scheduling. We observe a very high correlation with respect to the corresponding execution time graphs. Accordingly, we can conclude that there is no existing policy that is generally superior. The best policy will depend on the application and on the architecture configuration. However, we believe that none of the policies is able to fully leverage the heterogeneity of our architecture and that more intelligent thread scheduling policies are needed

to sustain the energy efficiency promised by single-isa heterogeneous multicore systems.!! "# % #! &# ' && &% #( #(!'! Fig. 4: Normalized execution time speedup and EtoS. 4 Conclusion In this paper, we evaluate performance and energy trade-offs of single-isa heterogeneous multicore system. The investigations were conducted on the Odroid XU3 board including an ARM big.little Exynos 5 Octa (5422) chip. We provided performance and energy results on the Rodinia OpenMP benchmark suit using typical loop scheduling policies, i.e. static, dynamic and guided. The results show that the given policies are inefficient in the use of heterogeneous cores. Therefore, we conclude that further research is required to propose suitable scheduling policies able to leverage the superior energy efficiency of LITTLE cores while maintaining the faster execution times of big cores. 5 Acknowledgement The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7/2007-2016) under the Mont-Blanc 2 Project: http://www.montblanc-project.eu, grant agreement n o 610402.

References 1. R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, Single-isa heterogeneous multi-core architectures for multithreaded workload performance, in Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 04, (Washington, DC, USA), pp. 64, IEEE Computer Society, 2004. 2. O. A. R. Board, The openmp api specification for parallel programming. http://openmp.org/wp/, November 2015. 3. Samsung, Exynos Octa SoC. https://http://www.samsung.com/, November 2015. 4. B. Jeff, big.little technology moves towards fully heterogeneous global task scheduling. http://www.arm.com/files/pdf/, November 2013. 5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 44 54, Oct 2009. 6. ARM, CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual, November 16 2012. Revision r1p1. 7. Scalasca. http://www.scalasca.org/, November 2015. 8. Vampir - performance optimization. https://www.vampir.eu/, November 2015.