Thread Affinity Experiments

Size: px
Start display at page:

Download "Thread Affinity Experiments"

Transcription

1 Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not available for every device, it is a very powerful addition when it is. Giving the developer the ability to decide which threads run where is more important than ever because scheduling systems, such as the Linux Completely Fair Scheduler (CFS), that are designed for symmetric multi-processing (SMP) systems are no longer entirely fit for purpose in heterogeneous multi processing (HMP) environments. These are by definition, unfair. The HMP approach has been adopted by all major System-on-Chip (SoC) vendors, whose systems generally comprise two clusters (depicted in Figure 1) of CPU cores: a high performance (and high power) big-cluster, and a lower performance (but lower power) little-cluster. The cores within each cluster are the same. Figure 1 Common octa-core arrangement comprising a big- and little- CPU-cluster of four identical cores each

2 This situation presents a number of pertinent questions about what is best to do: Is it better to let an app run unrestricted giving the scheduler free rein? Will better performance be achieved if the app is constrained to run on the big cluster? Will worse performance be seen if the app is constrained to run on the little cluster? If a single-threaded app is constrained to run on one core, perhaps saving on migration, would this be better, worse or no different than confining it to the cluster of that core? Is there a general rule, or is best-practice strongly dependent on the characteristics of a given app or device? etc This series of experiments explores some of these questions using the LPGPU2 Profiling Tool on the LPGPU2 Hypercube test app and a Samsung Galaxy S7 G930F. The purpose of these experiments is to investigate just how varied behaviour and performance can be in different CPU affinity locking scenarios, with the hope that it will shed some light on the questions posed. Experimental Device The device chosen for these experiments is a Samsung Galaxy S7 G930F. It is based on the Exynos 8890 processor and its basic spec is shown in Table 1 Device SM-G930F Resolution 1440 x 2560 RAM 4Gb Android 6.0 (Marshmallow) Chipset Exynos 8890 Octa GPU Mali-T880 CPU count 8 CPU s 4 x 2.3 GHz Mongoose 4 x 1.6 GHz Cortex-A53 Table 1 Experimental device, Samsung Galaxy S7 G930F basic spec

3 Experimental App The Hypercube app was chosen because it is very lightweight, offering a high frame rate which may, ironically, cause more work to be done on the CPU. Figure 2 shows some typical frames from the Hypercube app. Figure 2 Hypercube tumbling in 4D Analysis The Hypercube app was extended to make setting the CPU-affinity mask as simple as changing the value of an enum. The app was updated to report the CPU affinity of the main thread exactly once per frame. It was also updated to report frames per second (FPS) to User Counter 0. The updated Hypercube app was installed on the device, and in this initial configuration the thread affinity was unrestricted. This would be the first time we had directly observed the built in thread migration behaviour of a device in LPGPU2, although we have seen hints of it many times in the live CPU Load counter profiles of almost all previous experiments - one core drops from full load to near zero just as another core ramps up while performance remains unaffected. This common pattern could be explained by thread migration. In these experiments we would expect to see the behaviour explicitly. Also in previous experiments we have noted that it can take some minutes for a device to settle down after collection has begun. This can be especially problematic when trying to diagnose the asymptotic power usage characteristics of a particular app / device pairing. Because of this, Timer Mode was used for collection. In this mode, the user still starts a collection explicitly, but the collection will then run for a pre-set period. Termination occurs automatically at the end of this period. In the example shown in Figure 4, collection is set for five minutes the period used for collecting in these experiments.

4 Figure 4 Collection Mode selection panel showing Timer mode selected for 5 minutes For extra help in mitigating the unpredictable transient effects observed across all counter profiles, each experiment was conducted four times. This was to help expose how repeatable any particular result was. The CPU affinity results of the single-threaded Hypercube app for an unrestricted CPU affinity run are shown in Figure 5. They are from the four independent experiments. Battery power the most pertinent measure in the present experiments is shown alongside the CPU affinity, and although each run is different, a number of features are common to the four profiles a-d.

5 (a) (b) (c) (d) Figure 5 Four experiments showing CPU affinity and power consumption when CPU affinity is not restricted

6 First, it is clear from the profiles in Figure 5 that battery power reduces over time. There is a wide variation in the value of initial and final power consumption, but it is clear that power reduces to a fraction of its initial value. This is not due to anything within LPGPU2, it is simply the underlying black-box system responding to the shock of a collection being started. It does this by migrating processes and threads, adjusting process-priorities in no doubt many other proprietary tricks in order to reduce power while maintaining performance. Secondly, it is clear that the CPU affinity of the process threads migrate very often, and do so across all eight cores of the device. Upon closer inspection, however, it becomes clear that the app spends more time running on the lower cores (0,1,2 ) than the higher cores ( 5,6,7). The LPGPU2 Profiling Tool displays the instantaneous values of the CPU frequencies which immediately reveals that cores 0 3 represent the big cluster and cores 4 7 represent the little cluster. It is interesting to note that, by default, the system prefers to run the app on the big cluster, but also that it does not do so exclusively. The next sequence of experiments investigates power consumption when the system is tied to one of the clusters, first to the big cluster (cores 0,1,2 and 3) and then to the little cluster (cores 4,5,6 and 7). Figure 6 shows the result of four identical experiments profiling the Hypercube when tied to the big cluster for exactly 5 minutes.

7 a) b) c) d) Figure 6 Four experiments with CPU mask tied to the big cluster (cores 0,1,2 and 3) Power consumption and affinity shown

8 Firstly, it is clear that the behaviour of the app when constrained to the big cluster is very similar in form to the unrestricted affinity tests in that power is initially high, and then reduces over the period of the experiment. However, it should be noted that the results present in Figure 6d are very odd and do not fit the pattern. No explanation can be given for this except to say that with a system as complex as a modern Android device, it is simply not possible to know everything that is running, or why certain processes are spawned or woken at any given time. Such odd results and artefacts appear in profiles from time to time regardless of device or app. The only common factor is the Operating System. Secondly, it is clear from all four profiles that the system has honoured the request to lock the CPU affinity to the cpuset prescribed by the LPGPU2 API call. This is noteworthy as the cpuset bitmask is interpreted as a request; the system is not obliged to honour it. Thirdly, it is clear from all profiles that the thread is migrated very often. It is clear because of the almost solid blue bar that is the CPU Affinity counter profile that covers the values 0,1,2 and 3 the indices of the cores exclusively requested. Finally it is most interesting to note that (with the exception of the strange profile 6d) the asymptotic power consumption is approximately 50% of that when the app is allowed to run unrestricted. This is an exciting result. An enormous power reduction has been achieved with trivial modifications to the code, but the reason is not immediately obvious. If the big cluster is more expensive (in power) than the little cluster, why does limiting execution to the higher-power cluster result in a power reduction? An analysis of the Exynos architecture reveals that the cores of each cluster share a L2 cache: 2Mb for the big cluster and 256Kb for the little cluster. It could be that allowing the system to migrate the app between clusters is invalidating these caches, incurring a cost on other microsystems such as memory and busses. It is easy to imagine how constraining an app to run on one cluster could reduce this. If this phenomenon really is responsible for the power savings observed, then constraining the app to run on the little cluster may result in similar power reductions perhaps even greater. The next experiment was designed to explore this, and Figure 7 shows the results of constraining the app to run on the little cluster exclusively.

9 (a) (b) (c) (d) Figure 7 Four experiments with CPU mask tied to the little cluster (cores 0,1,2 and 3) Power consumption and affinity shown

10 As for previous experiments, four collections were run and a similar pattern emerges. Power is high initially and reduces over the duration of the experiment. Coincidentally, the fourth experiment in the series, shown in Figure 7d is unusual, though it still represents an overall reduction in power with time. It is interesting to note that thread migrations are much more sparse on the little cluster and furthermore they appear to contain a bias for cores 4 and 5 a feature not visible (at least by eye) in the profile for constraining to the big cluster (cores 0 3) The overall power reduction is still considerable compared with running unrestricted, however, but it is not noticeably (if at all) greater than the power reduction seen in constraining the app to run on the big cluster. This supports the hypothesis that cache invalidations are responsible for the increased power consumption due to the extra work required in populating a new cache. If hardware counters reporting cache invalidation were made available to the LPGPU2 Profiling Tool, the hypothesis could be tested more rigorously. Further experiments There is no end to the number of experiments that can be devised in an attempt to tease out the nature of the black-box algorithms responsible for thread migration on a given device. However, with the encouraging discovery that constraining an app to run on a single cluster seems to yield enormous power savings, another experiment presents itself. In the next experiment, the app is constrained once again to run on only four of the cores as before, but this time those cores will straddle the clusters. If cache invalidation is indeed responsible, then power consumption in this regime should be similar to the unconstrained experiment, or at least should be worse than running constrained to either cluster exclusively. Four collections from the same Hypercube app constrained to cores 2,3,4 and 5 were run. Cores 2 and 3 reside in the big cluster, and cores 4 and 5 are in the little cluster, so this run straddles the clusters. Figure 8 shows the results of the experiment and a familiar pattern is seen: Power starts high and reduces over time, although the final power value in each of the experiments is significantly greater than in the experiments with affinity tied exclusively to either cluster; the lowest current in this series is greater than 150mA. Contrast that with less than 100mA for the previous two clusterconstrained experiments. The results of the present experiment are comparable with the unconstrained case. Looking at the accompanying CPU affinity profiles of figure 8, it is confirmed that the app is indeed constrained to cores 2,3,4 and 5 and that it is being migrated between the clusters.

11 (a) (b) (c) (d) Figure 8 Four experiments showing power consumption and CPU affinity when CPU affinity is constrained to four cores spanning the clusters (cores 2,3,4 and 5)

12 CPU Affinity Patterns The experiments for the Exynos device show a consistent preference for scheduling threads to lower numbers of cores. In particular, cores 0,1,2 and 3 are the most preferred, cores 4 and 5 are the next most popular and cores 6 and 7 are the least popular. This means the scheduler prefers the big cluster over the little cluster for running the LPGPU2 test apps, and that when the little cluster is chosen, cores 4 and 5 are preferred over cores 6 and 7. Figure 9 shows a temporal zoom of some selected experiments to reveal the finer scale detail of the scheduling behaviour. Figure 9a manifests a pulse, regularly scheduling the thread to the little cluster. Although the time axis is not shown in these examples, the pulse frequency is approximately 1Hz. Figure 9b suggests that there is no scheduling preference for cores within the big cluster as no clear pattern can be seen. Figure 9c shows a similar experiment but constrained to the little cluster (cores 4,5,6 and 7) and there is a clear preference for cores 4 and 5 over cores 6 and 7. Figure 9d shows a zoomed section of a straddling experiment. It is microcosm of the unconstrained experiment in that it reveals favoured scheduling of the big cluster. Not only is more time spent in the big cluster, but scheduling on big cluster processes happens on a much smaller timescale than little cluster processes the time slicing of big cluster processes appears to be much shorter than for little cluster processes.

13 a) Unrestricted core affinity b) Core affinity restricted to the big cluster (cores 0,1,2 and 3) c) Core affinity restricted to the little cluster (cores 4,5,6 and 7) d) Core affinity restricted to 4 cores straddling the clusters Figure 9 Temporal zoom of Exynos core affinity profiles revealing different scheduling patterns for different affinity masks Conclusion These experiments with an octa-core dual-cluster device show that the ability to specify which threads are permitted to migrate between which CPU cores can be very powerful indeed. Significant power savings O(50%) are available for little, even trivial, development overhead. This exciting result was achieved by constraining an important task thread to run within a single cluster. The scheduler was free to migrate the thread between the cores of the cluster, but not to migrate the thread to the other cluster. The choice of which cluster big or little the thread was constrained to, was much less important than preventing thread migration between the clusters. Further work is required to ascertain the generality of these results. Will other (potentially very different) apps benefit from the same innovation, and will different devices respond in a similarly positive way?

LPGPU2 Font Renderer App

LPGPU2 Font Renderer App LPGPU2 Font Renderer App Performance Analysis Introduction As part of LPGPU2 Work Package 3, a font rendering app was developed to research the profiling characteristics of different font rendering algorithms.

More information

LPGPU2 Font Renderer App

LPGPU2 Font Renderer App LPGPU2 Font Renderer App Performance Analysis 2 Introduction As part of LPGPU2 Work Package 3, a font rendering app was developed to research the profiling characteristics of different font rendering algorithms.

More information

UnCovert: Evaluating thermal covert channels on Android systems. Pascal Wild

UnCovert: Evaluating thermal covert channels on Android systems. Pascal Wild UnCovert: Evaluating thermal covert channels on Android systems Pascal Wild August 5, 2016 Contents Introduction v 1: Framework 1 1.1 Source...................................... 1 1.2 Sink.......................................

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Dongjun Shin Samsung Electronics

Dongjun Shin Samsung Electronics 2014.10.31. Dongjun Shin Samsung Electronics Contents 2 Background Understanding CPU behavior Experiments Improvement idea Revisiting Linux I/O stack Conclusion Background Definition 3 CPU bound A computer

More information

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot Heterogeneous Multi-Processing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 21 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Why not increase page size

More information

LocAdoc Test Report. Version 1.0. Prepared by: Abhi Jay Krishnan Kim Hyeoncheol Rivaldo Erawan Durrah Afshan

LocAdoc Test Report. Version 1.0. Prepared by: Abhi Jay Krishnan Kim Hyeoncheol Rivaldo Erawan Durrah Afshan LocAdoc Test Report Version 1.0 Prepared by: Abhi Jay Krishnan Kim Hyeoncheol Rivaldo Erawan Durrah Afshan Table of Contents 1. Introduction 1 2. Test Plan Overview 1 2.1 Objective 1 2.2 Approach 1 2.3

More information

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM The ARM Business Model Global leader in the development of

More information

EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD

EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD 06 January, 2019 EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD Document Filetype: PDF 165.6 KB 0 EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD Samsung GT-S7500 Galaxy Ace Plus complete Service

More information

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 73 CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 7.1 INTRODUCTION The proposed DVS algorithm is implemented on DELL INSPIRON 6000 model laptop, which has Intel Pentium Mobile Processor

More information

Welcome to Part 3: Memory Systems and I/O

Welcome to Part 3: Memory Systems and I/O Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

Operating Systems. Process scheduling. Thomas Ropars.

Operating Systems. Process scheduling. Thomas Ropars. 1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture

More information

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux

More information

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu

More information

Parallel Computing Ideas

Parallel Computing Ideas Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:

More information

Heterogeneous Architecture. Luca Benini

Heterogeneous Architecture. Luca Benini Heterogeneous Architecture Luca Benini lbenini@iis.ee.ethz.ch Intel s Broadwell 03.05.2016 2 Qualcomm s Snapdragon 810 03.05.2016 3 AMD Bristol Ridge Departement Informationstechnologie und Elektrotechnik

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information

Evaluation of Real-time Performance in Embedded Linux. Hiraku Toyooka, Hitachi. LinuxCon Europe Hitachi, Ltd All rights reserved.

Evaluation of Real-time Performance in Embedded Linux. Hiraku Toyooka, Hitachi. LinuxCon Europe Hitachi, Ltd All rights reserved. Evaluation of Real-time Performance in Embedded Linux LinuxCon Europe 2014 Hiraku Toyooka, Hitachi 1 whoami Hiraku Toyooka Software engineer at Hitachi " Working on operating systems Linux (mainly) for

More information

Performance, Power, Die Yield. CS301 Prof Szajda

Performance, Power, Die Yield. CS301 Prof Szajda Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

WearDrive: Fast and Energy Efficient Storage for Wearables

WearDrive: Fast and Energy Efficient Storage for Wearables WearDrive: Fast and Energy Efficient Storage for Wearables Reza Shisheie Cleveland State University CIS 601 Wearable Computing: A New Era 2 Wearable Computing: A New Era Notifications Fitness/Healthcare

More information

Learning from Math Library Testng for C Marcel Beemster Solid Sands

Learning from Math Library Testng for C Marcel Beemster Solid Sands Learning from Math Library Testng for C Marcel Beemster Solid Sands Introduction In the process of improving SuperTest, I recently dived into its math library testing. Turns out there were some interesting

More information

Pexip Infinity Server Design Guide

Pexip Infinity Server Design Guide Pexip Infinity Server Design Guide Introduction This document describes the recommended specifications and deployment for servers hosting the Pexip Infinity platform. It starts with a Summary of recommendations

More information

ARM big.little Technology Unleashed An Improved User Experience Delivered

ARM big.little Technology Unleashed An Improved User Experience Delivered ARM big.little Technology Unleashed An Improved User Experience Delivered Govind Wathan Product Specialist Cortex -A Mobile & Consumer CPU Products 1 Agenda Introduction to big.little Technology Benefits

More information

Technical Documentation Version 7.4. Performance

Technical Documentation Version 7.4. Performance Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

POWER MANAGEMENT AND ENERGY EFFICIENCY

POWER MANAGEMENT AND ENERGY EFFICIENCY POWER MANAGEMENT AND ENERGY EFFICIENCY * Adopted Power Management for Embedded Systems, Minsoo Ryu 2017 Operating Systems Design Euiseong Seo (euiseong@skku.edu) Need for Power Management Power consumption

More information

Take charge of processor affinity

Take charge of processor affinity English Sign in (or register) Technical topics Evaluation software Community Events Take charge of processor affinity Why (three reasons) and how to use hard (versus soft) CPU affinity Eli Dow, Software

More information

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com

More information

Computer Science Window-Constrained Process Scheduling for Linux Systems

Computer Science Window-Constrained Process Scheduling for Linux Systems Window-Constrained Process Scheduling for Linux Systems Richard West Ivan Ganev Karsten Schwan Talk Outline Goals of this research DWCS background DWCS implementation details Design of the experiments

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Cypress Adopts Questa Formal Apps to Create Pristine IP

Cypress Adopts Questa Formal Apps to Create Pristine IP Cypress Adopts Questa Formal Apps to Create Pristine IP DAVID CRUTCHFIELD, SENIOR PRINCIPLE CAD ENGINEER, CYPRESS SEMICONDUCTOR Because it is time consuming and difficult to exhaustively verify our IP

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Power Management for Embedded Systems

Power Management for Embedded Systems Power Management for Embedded Systems Minsoo Ryu Hanyang University Why Power Management? Battery-operated devices Smartphones, digital cameras, and laptops use batteries Power savings and battery run

More information

Shared-Memory Programming

Shared-Memory Programming Shared-Memory Programming 1. Threads 2. Mutual Exclusion 3. Thread Scheduling 4. Thread Interfaces 4.1. POSIX Threads 4.2. C++ Threads 4.3. OpenMP 4.4. Threading Building Blocks 5. Side Effects of Hardware

More information

A TimeSys Perspective on the Linux Preemptible Kernel Version 1.0. White Paper

A TimeSys Perspective on the Linux Preemptible Kernel Version 1.0. White Paper A TimeSys Perspective on the Linux Preemptible Kernel Version 1.0 White Paper A TimeSys Perspective on the Linux Preemptible Kernel A White Paper from TimeSys Corporation Introduction One of the most basic

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CLOSE ENCOUNTERS OF THE UPSTREAM RESOURCE

CLOSE ENCOUNTERS OF THE UPSTREAM RESOURCE CLOSE ENCOUNTERS OF THE UPSTREAM RESOURCE HISAO MUNAKATA RENESAS SOLUTIONS CORP hisao.munakata.vt(at)renesas.com who am I Work for Renesas (semiconductor provider) Over 15 years real embedded Linux business

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

SMARTPHONE HARDWARE: ANATOMY OF A HANDSET. Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver

SMARTPHONE HARDWARE: ANATOMY OF A HANDSET. Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver SMARTPHONE HARDWARE: ANATOMY OF A HANDSET Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver Outline of topics What is the hardware architecture of a How does communication

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

The Impact on Performance of Mobile Devices & Connections

The Impact on Performance of Mobile Devices & Connections The Impact on Performance of Mobile Devices & Connections Prepared by: Paul Bianciardi Contents 1 Mobile Access to Mobile Sites...3 2 Impact of Latency...4 3 Non Mobile Site?...6 4 Redirects to Mobile

More information

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design

More information

Systems Ph.D. Qualifying Exam

Systems Ph.D. Qualifying Exam Systems Ph.D. Qualifying Exam Spring 2011 (March 22, 2011) NOTE: PLEASE ATTEMPT 6 OUT OF THE 8 QUESTIONS GIVEN BELOW. Question 1 (Multicore) There are now multiple outstanding proposals and prototype systems

More information

1993 Paper 3 Question 6

1993 Paper 3 Question 6 993 Paper 3 Question 6 Describe the functionality you would expect to find in the file system directory service of a multi-user operating system. [0 marks] Describe two ways in which multiple names for

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

Profiling and Debugging Games on Mobile Platforms

Profiling and Debugging Games on Mobile Platforms Profiling and Debugging Games on Mobile Platforms Lorenzo Dal Col Senior Software Engineer, Graphics Tools Gamelab 2013, Barcelona 26 th June 2013 Agenda Introduction to Performance Analysis with ARM DS-5

More information

Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology

Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology Tsung-Yao Lin, g-hsien Lee, Loda Chou, Clavin Peng, Jih-g Hsu, Jia-g Chen, John-CC Chen, Alex Chiou, Artis Chiu, David Lee, Carrie

More information

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures Robert A. Cohen SAS Institute Inc. Cary, North Carolina, USA Abstract Version 9targets the heavy-duty analytic procedures in SAS

More information

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases Steve Steele, ARM 1 Today s Computational Challenges Trends Growing display sizes and resolutions, richer

More information

This paper was presented at DVCon-Europe in November It received the conference Best Paper award based on audience voting.

This paper was presented at DVCon-Europe in November It received the conference Best Paper award based on audience voting. This paper was presented at DVCon-Europe in November 2015. It received the conference Best Paper award based on audience voting. It is a very slightly updated version of a paper that was presented at SNUG

More information

The Case for AXIe: Why instrument vendors should actively consider AXIe for their next modular platform.

The Case for AXIe: Why instrument vendors should actively consider AXIe for their next modular platform. The Case for AXIe: Why instrument vendors should actively consider AXIe for their next modular platform. As modular instruments gain market share in the test and measurement industry, vendors have a choice

More information

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram 2 3 Benchmarking SQLite is Non-trivial! Benchmarking complex systems in a repeatable fashion

More information

GridFTP Scalability and Performance Results Ioan Raicu Catalin Dumitrescu -

GridFTP Scalability and Performance Results Ioan Raicu Catalin Dumitrescu - GridFTP Scalability and Performance Results 2/12/25 Page 1 of 13 GridFTP Scalability and Performance Results Ioan Raicu iraicu@cs.uchicago.edu Catalin Dumitrescu - catalind@cs.uchicago.edu 1. Introduction

More information

Application Note 228

Application Note 228 Application Note 228 Implementing DMA on ARM SMP Systems Document number: ARM DAI 0228 A Issued: 1 st August 2009 Copyright ARM Limited 2009 Copyright 2006 ARM Limited. All rights reserved. Application

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Kernel Korner What's New in the 2.6 Scheduler

Kernel Korner What's New in the 2.6 Scheduler Kernel Korner What's New in the 2.6 Scheduler When large SMP systems started spending more time scheduling processes than running them, it was time for a change. by Rick Lindsley As work began on the 2.5

More information

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

«Real Time Embedded systems» Multi Masters Systems

«Real Time Embedded systems» Multi Masters Systems «Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9 Implementing Scheduling Algorithms Real-Time and Embedded Systems (M) Lecture 9 Lecture Outline Implementing real time systems Key concepts and constraints System architectures: Cyclic executive Microkernel

More information

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2 CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.

More information

Assignment 1 due Mon (Feb 4pm

Assignment 1 due Mon (Feb 4pm Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

GV-System V8.7 Supports H.265 GPU Decoding

GV-System V8.7 Supports H.265 GPU Decoding GV-System V8.7 Supports H.265 GPU Decoding Article ID: V1-16-07-15-a Applied to GV-System V8.7 Release Date: 07/15/2016 Summary It takes both Intel Skylake platform and GV-System V8.7 to enable the highly

More information

Random Access Memory (RAM)

Random Access Memory (RAM) Best known form of computer memory. "random access" because you can access any memory cell directly if you know the row and column that intersect at that cell. CS1111 CS5020 - Prof J.P. Morrison UCC 33

More information

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2 ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Mario Almeida, Liang Wang*, Jeremy Blackburn, Konstantina Papagiannaki, Jon Crowcroft* Telefonica

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Linux Kernel Hacking Free Course

Linux Kernel Hacking Free Course Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What

More information

What's New in vsan 6.2 First Published On: Last Updated On:

What's New in vsan 6.2 First Published On: Last Updated On: First Published On: 07-07-2016 Last Updated On: 08-23-2017 1 1. Introduction 1.1.Preface 1.2.Architecture Overview 2. Space Efficiency 2.1.Deduplication and Compression 2.2.RAID - 5/6 (Erasure Coding)

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Software-defined Storage by Veritas

Software-defined Storage by Veritas ESG Technology Showcase Software-defined Storage by Veritas Date: August 2015 Author: Scott Sinclair, Analyst Abstract: The days of enterprise storage technology being predominantly constrained to specific

More information

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Perfect Timing Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Problem & Solution College students do their best to plan out their daily tasks, but

More information

Evaluating external network bandwidth load for Google Apps

Evaluating external network bandwidth load for Google Apps Evaluating external network bandwidth load for Google Apps This document describes how to perform measurements to better understand how much network load will be caused by using a software as a service

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Examining Load Average

Examining Load Average Examining Load Average Source : http://www.linuxjournal.com/article/9001 Understanding work-load averages as opposed to CPU usage Many Linux administrators and support technicians regularly use the top

More information

IBM MQ Appliance HA and DR Performance Report Version July 2016

IBM MQ Appliance HA and DR Performance Report Version July 2016 IBM MQ Appliance HA and DR Performance Report Version 2. - July 216 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before using this report,

More information

Expanding Opportunities in Clamshell Devices. Laurence Bryant VP Strategic Marketing

Expanding Opportunities in Clamshell Devices. Laurence Bryant VP Strategic Marketing Expanding Opportunities in Clamshell Devices Laurence Bryant VP Strategic Marketing 1 PC Mobile Ecosystem Scaling The Richness Of Small Screen Experiences The smartphone and tablet ecosystem is shaping

More information

MiraVision Picture Quality Enhancement Technology for Displays MediaTek White Paper January 2015

MiraVision Picture Quality Enhancement Technology for Displays MediaTek White Paper January 2015 Picture Quality Enhancement Technology for Displays MediaTek White Paper January 2015 2015 MediaTek Inc. 1 The Total Solution to Picture Quality Enhancement In multi media technology the display interface

More information

Decoupled Access-Execute on ARM big.little

Decoupled Access-Execute on ARM big.little IT 16 059 Examensarbete 30 hp Augusti 2016 Decoupled Access-Execute on ARM big.little Anton Weber Institutionen för informationsteknologi Department of Information Technology Abstract Decoupled Access-Execute

More information