Thread Affinity Experiments
|
|
- Marshall Mills
- 5 years ago
- Views:
Transcription
1 Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not available for every device, it is a very powerful addition when it is. Giving the developer the ability to decide which threads run where is more important than ever because scheduling systems, such as the Linux Completely Fair Scheduler (CFS), that are designed for symmetric multi-processing (SMP) systems are no longer entirely fit for purpose in heterogeneous multi processing (HMP) environments. These are by definition, unfair. The HMP approach has been adopted by all major System-on-Chip (SoC) vendors, whose systems generally comprise two clusters (depicted in Figure 1) of CPU cores: a high performance (and high power) big-cluster, and a lower performance (but lower power) little-cluster. The cores within each cluster are the same. Figure 1 Common octa-core arrangement comprising a big- and little- CPU-cluster of four identical cores each
2 This situation presents a number of pertinent questions about what is best to do: Is it better to let an app run unrestricted giving the scheduler free rein? Will better performance be achieved if the app is constrained to run on the big cluster? Will worse performance be seen if the app is constrained to run on the little cluster? If a single-threaded app is constrained to run on one core, perhaps saving on migration, would this be better, worse or no different than confining it to the cluster of that core? Is there a general rule, or is best-practice strongly dependent on the characteristics of a given app or device? etc This series of experiments explores some of these questions using the LPGPU2 Profiling Tool on the LPGPU2 Hypercube test app and a Samsung Galaxy S7 G930F. The purpose of these experiments is to investigate just how varied behaviour and performance can be in different CPU affinity locking scenarios, with the hope that it will shed some light on the questions posed. Experimental Device The device chosen for these experiments is a Samsung Galaxy S7 G930F. It is based on the Exynos 8890 processor and its basic spec is shown in Table 1 Device SM-G930F Resolution 1440 x 2560 RAM 4Gb Android 6.0 (Marshmallow) Chipset Exynos 8890 Octa GPU Mali-T880 CPU count 8 CPU s 4 x 2.3 GHz Mongoose 4 x 1.6 GHz Cortex-A53 Table 1 Experimental device, Samsung Galaxy S7 G930F basic spec
3 Experimental App The Hypercube app was chosen because it is very lightweight, offering a high frame rate which may, ironically, cause more work to be done on the CPU. Figure 2 shows some typical frames from the Hypercube app. Figure 2 Hypercube tumbling in 4D Analysis The Hypercube app was extended to make setting the CPU-affinity mask as simple as changing the value of an enum. The app was updated to report the CPU affinity of the main thread exactly once per frame. It was also updated to report frames per second (FPS) to User Counter 0. The updated Hypercube app was installed on the device, and in this initial configuration the thread affinity was unrestricted. This would be the first time we had directly observed the built in thread migration behaviour of a device in LPGPU2, although we have seen hints of it many times in the live CPU Load counter profiles of almost all previous experiments - one core drops from full load to near zero just as another core ramps up while performance remains unaffected. This common pattern could be explained by thread migration. In these experiments we would expect to see the behaviour explicitly. Also in previous experiments we have noted that it can take some minutes for a device to settle down after collection has begun. This can be especially problematic when trying to diagnose the asymptotic power usage characteristics of a particular app / device pairing. Because of this, Timer Mode was used for collection. In this mode, the user still starts a collection explicitly, but the collection will then run for a pre-set period. Termination occurs automatically at the end of this period. In the example shown in Figure 4, collection is set for five minutes the period used for collecting in these experiments.
4 Figure 4 Collection Mode selection panel showing Timer mode selected for 5 minutes For extra help in mitigating the unpredictable transient effects observed across all counter profiles, each experiment was conducted four times. This was to help expose how repeatable any particular result was. The CPU affinity results of the single-threaded Hypercube app for an unrestricted CPU affinity run are shown in Figure 5. They are from the four independent experiments. Battery power the most pertinent measure in the present experiments is shown alongside the CPU affinity, and although each run is different, a number of features are common to the four profiles a-d.
5 (a) (b) (c) (d) Figure 5 Four experiments showing CPU affinity and power consumption when CPU affinity is not restricted
6 First, it is clear from the profiles in Figure 5 that battery power reduces over time. There is a wide variation in the value of initial and final power consumption, but it is clear that power reduces to a fraction of its initial value. This is not due to anything within LPGPU2, it is simply the underlying black-box system responding to the shock of a collection being started. It does this by migrating processes and threads, adjusting process-priorities in no doubt many other proprietary tricks in order to reduce power while maintaining performance. Secondly, it is clear that the CPU affinity of the process threads migrate very often, and do so across all eight cores of the device. Upon closer inspection, however, it becomes clear that the app spends more time running on the lower cores (0,1,2 ) than the higher cores ( 5,6,7). The LPGPU2 Profiling Tool displays the instantaneous values of the CPU frequencies which immediately reveals that cores 0 3 represent the big cluster and cores 4 7 represent the little cluster. It is interesting to note that, by default, the system prefers to run the app on the big cluster, but also that it does not do so exclusively. The next sequence of experiments investigates power consumption when the system is tied to one of the clusters, first to the big cluster (cores 0,1,2 and 3) and then to the little cluster (cores 4,5,6 and 7). Figure 6 shows the result of four identical experiments profiling the Hypercube when tied to the big cluster for exactly 5 minutes.
7 a) b) c) d) Figure 6 Four experiments with CPU mask tied to the big cluster (cores 0,1,2 and 3) Power consumption and affinity shown
8 Firstly, it is clear that the behaviour of the app when constrained to the big cluster is very similar in form to the unrestricted affinity tests in that power is initially high, and then reduces over the period of the experiment. However, it should be noted that the results present in Figure 6d are very odd and do not fit the pattern. No explanation can be given for this except to say that with a system as complex as a modern Android device, it is simply not possible to know everything that is running, or why certain processes are spawned or woken at any given time. Such odd results and artefacts appear in profiles from time to time regardless of device or app. The only common factor is the Operating System. Secondly, it is clear from all four profiles that the system has honoured the request to lock the CPU affinity to the cpuset prescribed by the LPGPU2 API call. This is noteworthy as the cpuset bitmask is interpreted as a request; the system is not obliged to honour it. Thirdly, it is clear from all profiles that the thread is migrated very often. It is clear because of the almost solid blue bar that is the CPU Affinity counter profile that covers the values 0,1,2 and 3 the indices of the cores exclusively requested. Finally it is most interesting to note that (with the exception of the strange profile 6d) the asymptotic power consumption is approximately 50% of that when the app is allowed to run unrestricted. This is an exciting result. An enormous power reduction has been achieved with trivial modifications to the code, but the reason is not immediately obvious. If the big cluster is more expensive (in power) than the little cluster, why does limiting execution to the higher-power cluster result in a power reduction? An analysis of the Exynos architecture reveals that the cores of each cluster share a L2 cache: 2Mb for the big cluster and 256Kb for the little cluster. It could be that allowing the system to migrate the app between clusters is invalidating these caches, incurring a cost on other microsystems such as memory and busses. It is easy to imagine how constraining an app to run on one cluster could reduce this. If this phenomenon really is responsible for the power savings observed, then constraining the app to run on the little cluster may result in similar power reductions perhaps even greater. The next experiment was designed to explore this, and Figure 7 shows the results of constraining the app to run on the little cluster exclusively.
9 (a) (b) (c) (d) Figure 7 Four experiments with CPU mask tied to the little cluster (cores 0,1,2 and 3) Power consumption and affinity shown
10 As for previous experiments, four collections were run and a similar pattern emerges. Power is high initially and reduces over the duration of the experiment. Coincidentally, the fourth experiment in the series, shown in Figure 7d is unusual, though it still represents an overall reduction in power with time. It is interesting to note that thread migrations are much more sparse on the little cluster and furthermore they appear to contain a bias for cores 4 and 5 a feature not visible (at least by eye) in the profile for constraining to the big cluster (cores 0 3) The overall power reduction is still considerable compared with running unrestricted, however, but it is not noticeably (if at all) greater than the power reduction seen in constraining the app to run on the big cluster. This supports the hypothesis that cache invalidations are responsible for the increased power consumption due to the extra work required in populating a new cache. If hardware counters reporting cache invalidation were made available to the LPGPU2 Profiling Tool, the hypothesis could be tested more rigorously. Further experiments There is no end to the number of experiments that can be devised in an attempt to tease out the nature of the black-box algorithms responsible for thread migration on a given device. However, with the encouraging discovery that constraining an app to run on a single cluster seems to yield enormous power savings, another experiment presents itself. In the next experiment, the app is constrained once again to run on only four of the cores as before, but this time those cores will straddle the clusters. If cache invalidation is indeed responsible, then power consumption in this regime should be similar to the unconstrained experiment, or at least should be worse than running constrained to either cluster exclusively. Four collections from the same Hypercube app constrained to cores 2,3,4 and 5 were run. Cores 2 and 3 reside in the big cluster, and cores 4 and 5 are in the little cluster, so this run straddles the clusters. Figure 8 shows the results of the experiment and a familiar pattern is seen: Power starts high and reduces over time, although the final power value in each of the experiments is significantly greater than in the experiments with affinity tied exclusively to either cluster; the lowest current in this series is greater than 150mA. Contrast that with less than 100mA for the previous two clusterconstrained experiments. The results of the present experiment are comparable with the unconstrained case. Looking at the accompanying CPU affinity profiles of figure 8, it is confirmed that the app is indeed constrained to cores 2,3,4 and 5 and that it is being migrated between the clusters.
11 (a) (b) (c) (d) Figure 8 Four experiments showing power consumption and CPU affinity when CPU affinity is constrained to four cores spanning the clusters (cores 2,3,4 and 5)
12 CPU Affinity Patterns The experiments for the Exynos device show a consistent preference for scheduling threads to lower numbers of cores. In particular, cores 0,1,2 and 3 are the most preferred, cores 4 and 5 are the next most popular and cores 6 and 7 are the least popular. This means the scheduler prefers the big cluster over the little cluster for running the LPGPU2 test apps, and that when the little cluster is chosen, cores 4 and 5 are preferred over cores 6 and 7. Figure 9 shows a temporal zoom of some selected experiments to reveal the finer scale detail of the scheduling behaviour. Figure 9a manifests a pulse, regularly scheduling the thread to the little cluster. Although the time axis is not shown in these examples, the pulse frequency is approximately 1Hz. Figure 9b suggests that there is no scheduling preference for cores within the big cluster as no clear pattern can be seen. Figure 9c shows a similar experiment but constrained to the little cluster (cores 4,5,6 and 7) and there is a clear preference for cores 4 and 5 over cores 6 and 7. Figure 9d shows a zoomed section of a straddling experiment. It is microcosm of the unconstrained experiment in that it reveals favoured scheduling of the big cluster. Not only is more time spent in the big cluster, but scheduling on big cluster processes happens on a much smaller timescale than little cluster processes the time slicing of big cluster processes appears to be much shorter than for little cluster processes.
13 a) Unrestricted core affinity b) Core affinity restricted to the big cluster (cores 0,1,2 and 3) c) Core affinity restricted to the little cluster (cores 4,5,6 and 7) d) Core affinity restricted to 4 cores straddling the clusters Figure 9 Temporal zoom of Exynos core affinity profiles revealing different scheduling patterns for different affinity masks Conclusion These experiments with an octa-core dual-cluster device show that the ability to specify which threads are permitted to migrate between which CPU cores can be very powerful indeed. Significant power savings O(50%) are available for little, even trivial, development overhead. This exciting result was achieved by constraining an important task thread to run within a single cluster. The scheduler was free to migrate the thread between the cores of the cluster, but not to migrate the thread to the other cluster. The choice of which cluster big or little the thread was constrained to, was much less important than preventing thread migration between the clusters. Further work is required to ascertain the generality of these results. Will other (potentially very different) apps benefit from the same innovation, and will different devices respond in a similarly positive way?
LPGPU2 Font Renderer App
LPGPU2 Font Renderer App Performance Analysis Introduction As part of LPGPU2 Work Package 3, a font rendering app was developed to research the profiling characteristics of different font rendering algorithms.
More informationLPGPU2 Font Renderer App
LPGPU2 Font Renderer App Performance Analysis 2 Introduction As part of LPGPU2 Work Package 3, a font rendering app was developed to research the profiling characteristics of different font rendering algorithms.
More informationUnCovert: Evaluating thermal covert channels on Android systems. Pascal Wild
UnCovert: Evaluating thermal covert channels on Android systems Pascal Wild August 5, 2016 Contents Introduction v 1: Framework 1 1.1 Source...................................... 1 1.2 Sink.......................................
More informationMediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency
MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationDongjun Shin Samsung Electronics
2014.10.31. Dongjun Shin Samsung Electronics Contents 2 Background Understanding CPU behavior Experiments Improvement idea Revisiting Linux I/O stack Conclusion Background Definition 3 CPU bound A computer
More informationMediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency
MediaTek CorePilot Heterogeneous Multi-Processing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 21 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Why not increase page size
More informationLocAdoc Test Report. Version 1.0. Prepared by: Abhi Jay Krishnan Kim Hyeoncheol Rivaldo Erawan Durrah Afshan
LocAdoc Test Report Version 1.0 Prepared by: Abhi Jay Krishnan Kim Hyeoncheol Rivaldo Erawan Durrah Afshan Table of Contents 1. Introduction 1 2. Test Plan Overview 1 2.1 Objective 1 2.2 Approach 1 2.3
More informationIntegrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM
Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM The ARM Business Model Global leader in the development of
More informationEPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD
06 January, 2019 EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD Document Filetype: PDF 165.6 KB 0 EPUB // SAMSUNG GALAXY 7500 ONLINE MANUAL DOWNLOAD Samsung GT-S7500 Galaxy Ace Plus complete Service
More informationCHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER
73 CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 7.1 INTRODUCTION The proposed DVS algorithm is implemented on DELL INSPIRON 6000 model laptop, which has Intel Pentium Mobile Processor
More informationWelcome to Part 3: Memory Systems and I/O
Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationOperating Systems. Process scheduling. Thomas Ropars.
1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture
More informationA NUMA Aware Scheduler for a Parallel Sparse Direct Solver
Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux
More informationQuantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu
More informationParallel Computing Ideas
Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:
More informationHeterogeneous Architecture. Luca Benini
Heterogeneous Architecture Luca Benini lbenini@iis.ee.ethz.ch Intel s Broadwell 03.05.2016 2 Qualcomm s Snapdragon 810 03.05.2016 3 AMD Bristol Ridge Departement Informationstechnologie und Elektrotechnik
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationQLIKVIEW SCALABILITY BENCHMARK WHITE PAPER
QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge
More informationEvaluation of Real-time Performance in Embedded Linux. Hiraku Toyooka, Hitachi. LinuxCon Europe Hitachi, Ltd All rights reserved.
Evaluation of Real-time Performance in Embedded Linux LinuxCon Europe 2014 Hiraku Toyooka, Hitachi 1 whoami Hiraku Toyooka Software engineer at Hitachi " Working on operating systems Linux (mainly) for
More informationPerformance, Power, Die Yield. CS301 Prof Szajda
Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationBuilding High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye
Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationWearDrive: Fast and Energy Efficient Storage for Wearables
WearDrive: Fast and Energy Efficient Storage for Wearables Reza Shisheie Cleveland State University CIS 601 Wearable Computing: A New Era 2 Wearable Computing: A New Era Notifications Fitness/Healthcare
More informationLearning from Math Library Testng for C Marcel Beemster Solid Sands
Learning from Math Library Testng for C Marcel Beemster Solid Sands Introduction In the process of improving SuperTest, I recently dived into its math library testing. Turns out there were some interesting
More informationPexip Infinity Server Design Guide
Pexip Infinity Server Design Guide Introduction This document describes the recommended specifications and deployment for servers hosting the Pexip Infinity platform. It starts with a Summary of recommendations
More informationARM big.little Technology Unleashed An Improved User Experience Delivered
ARM big.little Technology Unleashed An Improved User Experience Delivered Govind Wathan Product Specialist Cortex -A Mobile & Consumer CPU Products 1 Agenda Introduction to big.little Technology Benefits
More informationTechnical Documentation Version 7.4. Performance
Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted
More informationChapter 8 Virtual Memory
Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven
More informationPOWER MANAGEMENT AND ENERGY EFFICIENCY
POWER MANAGEMENT AND ENERGY EFFICIENCY * Adopted Power Management for Embedded Systems, Minsoo Ryu 2017 Operating Systems Design Euiseong Seo (euiseong@skku.edu) Need for Power Management Power consumption
More informationTake charge of processor affinity
English Sign in (or register) Technical topics Evaluation software Community Events Take charge of processor affinity Why (three reasons) and how to use hard (versus soft) CPU affinity Eli Dow, Software
More informationOperating System Support for Shared-ISA Asymmetric Multi-core Architectures
Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com
More informationComputer Science Window-Constrained Process Scheduling for Linux Systems
Window-Constrained Process Scheduling for Linux Systems Richard West Ivan Ganev Karsten Schwan Talk Outline Goals of this research DWCS background DWCS implementation details Design of the experiments
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationHyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.
Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using
More informationCypress Adopts Questa Formal Apps to Create Pristine IP
Cypress Adopts Questa Formal Apps to Create Pristine IP DAVID CRUTCHFIELD, SENIOR PRINCIPLE CAD ENGINEER, CYPRESS SEMICONDUCTOR Because it is time consuming and difficult to exhaustively verify our IP
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationPlot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;
How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory
More informationPower Management for Embedded Systems
Power Management for Embedded Systems Minsoo Ryu Hanyang University Why Power Management? Battery-operated devices Smartphones, digital cameras, and laptops use batteries Power savings and battery run
More informationShared-Memory Programming
Shared-Memory Programming 1. Threads 2. Mutual Exclusion 3. Thread Scheduling 4. Thread Interfaces 4.1. POSIX Threads 4.2. C++ Threads 4.3. OpenMP 4.4. Threading Building Blocks 5. Side Effects of Hardware
More informationA TimeSys Perspective on the Linux Preemptible Kernel Version 1.0. White Paper
A TimeSys Perspective on the Linux Preemptible Kernel Version 1.0 White Paper A TimeSys Perspective on the Linux Preemptible Kernel A White Paper from TimeSys Corporation Introduction One of the most basic
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCLOSE ENCOUNTERS OF THE UPSTREAM RESOURCE
CLOSE ENCOUNTERS OF THE UPSTREAM RESOURCE HISAO MUNAKATA RENESAS SOLUTIONS CORP hisao.munakata.vt(at)renesas.com who am I Work for Renesas (semiconductor provider) Over 15 years real embedded Linux business
More informationInfrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
More informationSMARTPHONE HARDWARE: ANATOMY OF A HANDSET. Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver
SMARTPHONE HARDWARE: ANATOMY OF A HANDSET Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver Outline of topics What is the hardware architecture of a How does communication
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationThe Impact on Performance of Mobile Devices & Connections
The Impact on Performance of Mobile Devices & Connections Prepared by: Paul Bianciardi Contents 1 Mobile Access to Mobile Sites...3 2 Impact of Latency...4 3 Non Mobile Site?...6 4 Redirects to Mobile
More informationBig.LITTLE Processing with ARM Cortex -A15 & Cortex-A7
Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design
More informationSystems Ph.D. Qualifying Exam
Systems Ph.D. Qualifying Exam Spring 2011 (March 22, 2011) NOTE: PLEASE ATTEMPT 6 OUT OF THE 8 QUESTIONS GIVEN BELOW. Question 1 (Multicore) There are now multiple outstanding proposals and prototype systems
More information1993 Paper 3 Question 6
993 Paper 3 Question 6 Describe the functionality you would expect to find in the file system directory service of a multi-user operating system. [0 marks] Describe two ways in which multiple names for
More informationImplementing a Statically Adaptive Software RAID System
Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems
More informationProfiling and Debugging Games on Mobile Platforms
Profiling and Debugging Games on Mobile Platforms Lorenzo Dal Col Senior Software Engineer, Graphics Tools Gamelab 2013, Barcelona 26 th June 2013 Agenda Introduction to Performance Analysis with ARM DS-5
More informationHelio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology
Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology Tsung-Yao Lin, g-hsien Lee, Loda Chou, Clavin Peng, Jih-g Hsu, Jia-g Chen, John-CC Chen, Alex Chiou, Artis Chiu, David Lee, Carrie
More informationSAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures
SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures Robert A. Cohen SAS Institute Inc. Cary, North Carolina, USA Abstract Version 9targets the heavy-duty analytic procedures in SAS
More informationUnleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM
Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases Steve Steele, ARM 1 Today s Computational Challenges Trends Growing display sizes and resolutions, richer
More informationThis paper was presented at DVCon-Europe in November It received the conference Best Paper award based on audience voting.
This paper was presented at DVCon-Europe in November 2015. It received the conference Best Paper award based on audience voting. It is a very slightly updated version of a paper that was presented at SNUG
More informationThe Case for AXIe: Why instrument vendors should actively consider AXIe for their next modular platform.
The Case for AXIe: Why instrument vendors should actively consider AXIe for their next modular platform. As modular instruments gain market share in the test and measurement industry, vendors have a choice
More informationThe Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram
The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram 2 3 Benchmarking SQLite is Non-trivial! Benchmarking complex systems in a repeatable fashion
More informationGridFTP Scalability and Performance Results Ioan Raicu Catalin Dumitrescu -
GridFTP Scalability and Performance Results 2/12/25 Page 1 of 13 GridFTP Scalability and Performance Results Ioan Raicu iraicu@cs.uchicago.edu Catalin Dumitrescu - catalind@cs.uchicago.edu 1. Introduction
More informationApplication Note 228
Application Note 228 Implementing DMA on ARM SMP Systems Document number: ARM DAI 0228 A Issued: 1 st August 2009 Copyright ARM Limited 2009 Copyright 2006 ARM Limited. All rights reserved. Application
More informationLecture 16. Today: Start looking into memory hierarchy Cache$! Yay!
Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationKernel Korner What's New in the 2.6 Scheduler
Kernel Korner What's New in the 2.6 Scheduler When large SMP systems started spending more time scheduling processes than running them, it was time for a change. by Rick Lindsley As work began on the 2.5
More informationTDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures
TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against
More information6. Results. This section describes the performance that was achieved using the RAMA file system.
6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding
More informationA task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b
5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of
More informationCS 326: Operating Systems. CPU Scheduling. Lecture 6
CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationImplementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9
Implementing Scheduling Algorithms Real-Time and Embedded Systems (M) Lecture 9 Lecture Outline Implementing real time systems Key concepts and constraints System architectures: Cyclic executive Microkernel
More information8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2
CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.
More informationAssignment 1 due Mon (Feb 4pm
Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!
More informationGrand Central Dispatch
A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model
More informationGV-System V8.7 Supports H.265 GPU Decoding
GV-System V8.7 Supports H.265 GPU Decoding Article ID: V1-16-07-15-a Applied to GV-System V8.7 Release Date: 07/15/2016 Summary It takes both Intel Skylake platform and GV-System V8.7 to enable the highly
More informationRandom Access Memory (RAM)
Best known form of computer memory. "random access" because you can access any memory cell directly if you know the row and column that intersect at that cell. CS1111 CS5020 - Prof J.P. Morrison UCC 33
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will
More informationChapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.
More informationDiffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading
Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Mario Almeida, Liang Wang*, Jeremy Blackburn, Konstantina Papagiannaki, Jon Crowcroft* Telefonica
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More informationCache introduction. April 16, Howard Huang 1
Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently
More informationHow to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform
How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationLinux Kernel Hacking Free Course
Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What
More informationWhat's New in vsan 6.2 First Published On: Last Updated On:
First Published On: 07-07-2016 Last Updated On: 08-23-2017 1 1. Introduction 1.1.Preface 1.2.Architecture Overview 2. Space Efficiency 2.1.Deduplication and Compression 2.2.RAID - 5/6 (Erasure Coding)
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationSoftware-defined Storage by Veritas
ESG Technology Showcase Software-defined Storage by Veritas Date: August 2015 Author: Scott Sinclair, Analyst Abstract: The days of enterprise storage technology being predominantly constrained to specific
More informationPerfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation
Perfect Timing Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Problem & Solution College students do their best to plan out their daily tasks, but
More informationEvaluating external network bandwidth load for Google Apps
Evaluating external network bandwidth load for Google Apps This document describes how to perform measurements to better understand how much network load will be caused by using a software as a service
More informationCS 136: Advanced Architecture. Review of Caches
1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you
More informationExamining Load Average
Examining Load Average Source : http://www.linuxjournal.com/article/9001 Understanding work-load averages as opposed to CPU usage Many Linux administrators and support technicians regularly use the top
More informationIBM MQ Appliance HA and DR Performance Report Version July 2016
IBM MQ Appliance HA and DR Performance Report Version 2. - July 216 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before using this report,
More informationExpanding Opportunities in Clamshell Devices. Laurence Bryant VP Strategic Marketing
Expanding Opportunities in Clamshell Devices Laurence Bryant VP Strategic Marketing 1 PC Mobile Ecosystem Scaling The Richness Of Small Screen Experiences The smartphone and tablet ecosystem is shaping
More informationMiraVision Picture Quality Enhancement Technology for Displays MediaTek White Paper January 2015
Picture Quality Enhancement Technology for Displays MediaTek White Paper January 2015 2015 MediaTek Inc. 1 The Total Solution to Picture Quality Enhancement In multi media technology the display interface
More informationDecoupled Access-Execute on ARM big.little
IT 16 059 Examensarbete 30 hp Augusti 2016 Decoupled Access-Execute on ARM big.little Anton Weber Institutionen för informationsteknologi Department of Information Technology Abstract Decoupled Access-Execute
More information