Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Similar documents
Performance Implications of Single Thread Migration on a Chip Multi-Core

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Speculative Multithreaded Processors

Chip-Multithreading Systems Need A New Operating Systems Scheduler

ECE/CS 757: Homework 1

Performance of computer systems

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

CSE502: Computer Architecture CSE 502: Computer Architecture

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Speculative Multithreaded Processors

POSH: A TLS Compiler that Exploits Program Structure

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Execution-based Prediction Using Speculative Slices

Microarchitectural Floorplanning for Thermal Management: A Technical Report

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Computer Architecture Spring 2016

An Updated Evaluation of ReCycle

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Microprocessor Trends and Implications for the Future

Low-Complexity Reorder Buffer Architecture*

Pipelining, Branch Prediction, Trends

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Final Lecture. A few minutes to wrap up and add some perspective

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Multithreaded Processors. Department of Electrical Engineering Stanford University

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

The Implications of Multi-core

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 1: Introduction

Use-Based Register Caching with Decoupled Indexing

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Jim Keller. Digital Equipment Corp. Hudson MA

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Impact of Cache Coherence Protocols on the Processing of Network Traffic

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

{ssnavada, nkchoudh,

Memory Hierarchy. Slides contents from:

Re-Visiting the Performance Impact of Microarchitectural Floorplanning

Advanced Computer Architecture (CS620)

Understanding Cache Interference

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

FPGA-Accelerated Instrumentation

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Warming up Storage-level Caches with Bonfire

Memory Hierarchy. Slides contents from:

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Area-Efficient Error Protection for Caches

Dynamically Controlled Resource Allocation in SMT Processors

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Hardware-Based Speculation

Lec 25: Parallel Processors. Announcements

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Conjoined-core Chip Multiprocessing

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Advanced Parallel Programming I

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 14: Multithreading

Optimizing Replication, Communication, and Capacity Allocation in CMPs

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

Benchmarking the Memory Hierarchy of Modern GPUs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

THREAD LEVEL PARALLELISM

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Chapter 5. Multiprocessors and Thread-Level Parallelism

Transcription:

3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec Department of Computer Science, Nicosia IRISA/INRIA Campus de Beaulieu 35042 Rennes Cedex, France

Motivation Computer industry conflicting challenges: Market calls for high performance cores Need for low cost, low power and short time to market One of the industry responses is single chip multi-core processors Multi-cores can increase performance with thread and program level parallelism Thread Migration.2

Temperature Wall But parallel execution can result in Overall higher power consumption Non-uniform power-density map Temperature increase Produced by ATMI using a simulated power density map http://www.irisa.fr/caps/projects/atmi/ Thread Migration.3

Thread Migration.4 Harness Temperature (1) Improve packaging and cooling (exotic/expensive) Temperature is not a packaging only issue anymore: important for circuit, microarchitecture, OS design

Harness Temperature (2) DTM techniques: turn-off clock operate cores at a lower voltage and frequency to ensure correct and/or efficient operation Cost: lower performance This work aim: solve the temperature problem with minimal performance loss using Activity Migration (AM) Thread Migration.5

Activity Migration (1) Leverage multi-cores to distribute activity and temperature over the entire chip Basic Idea: transfer thread from a hot core to a cold core (swap threads). Ex. 4 cores and 4 threads* *A study of Thread Migration in Temperature Constrained Multicores P. Michaud and, to appear in TACO 2007 Thread Migration.6

Activity Migration (2) Prefer AM with minimum performance and design overheads OS can implement AM. BUT the shorter the migration period the larger the temperature reduction* *Scheduling Issues on Thermally Constrained Processors, P. Michaud and, IRISA TR 6006, 2006 Thread Migration.7

Thread Migration.8 Activity Migration (3) At what layer to implement AM: Shorter OS scheduling periods, or/and Hardware support for faster activity migration This talk Present initial results that quantify the performance implications of thread AM on a four core system. Understand the bottlenecks, determine possible ways to minimize AM overhead

Others on Activity Migration Intel introduce term Core-Hopping in 2002 Hao et al ISLPED2003 proposed the term AM Intel in 80-core TSTP uses core-hopping to deal with hotspots, EE Times Jan. 2007 Thread Migration.9

Outline Multi-core Architecture and Activity Migration Model Experimental Framework Results Conclusions Future Work Thread Migration.10

Thread Migration.11 Multi-core Architecture-private L1/L2 Core 1 Core 2 Core 3 Core 4 DL1 IL1 DL1 IL1 DL1 IL1 DL1 IL1 L2 $ L2 $ L2 $ L2 $ Main Memory

Activity Migration Model Assume fixed migration frequency for all cores Every migration for each running thread: Assign a new core to the thread Empty current core pipeline Flush DL1 and L2 to maintain coherence Simple and possibly low performing When more than 1 thread running all cores will try to flush their L2 simultaneously Predictors state not flushed Transfer register to new core (PC, registers) When new core ready resume execution Thread Migration.12

Activity Migration Time Overhead Direct: empty pipe + flush caches + transfer register + wait for new core to finish flushing Indirect Cold Effects: cold caches stale predictor state OR other thread(s) updating predictor Thread Migration.13

Thread Migration.14 AM Design Space Explored Workload size: 1-3 threads Number of cores: 4 Migration frequency: 1ms, 0.1ms, 0.01ms Effect of cold/warm L1/L2 caches and cold/warm branch predictor Caches Warm Predictor Warm (Baseline no AM) Caches Warm Predictor Cold (predictor implicat.) Caches Cold Predictor Warm (cache implications) Caches Cold Predictor Drowsy Activity Migration AM patterns: Round-robin (RR) over all cores Swaps (SWAP) between two cores (for two-thread workload)

Thread Migration.15 Experimental Framework Simulator Core based on a modified version of sim-alpha Multi-core simulator is a parallel program Using pthreads (Donald and Martonosi CAL2006) Multi-core Configuration OOO Core: 4-way superscalar 15 pipeline stages 3GHz frequency Icache/Dcache: 64B block, 64KB, 2way, 1/3 cycles L2: 2MB, 4-way, 10 cycles access latency Predictor: 16KB combining predictor (bimodal-gshare) Shared memory bus 8 Bytes wide, 500MHz Split transaction bus between L2 and memory arbitration based on OS thread scheduling order 200 cycles minimum memory latency

Thread Migration.16 Benchmarks, Methodology and Metrics SPEC CPU2000 Fast forward 1billion instructions, executing until the slowest thread commits at least 100million instructions Performance Metric: How faster is multi-core run over the same workload when run sequentially Normalized to the Speedup of CMP baseline with no AM Assumption: register transfer latency for AM has no latency and no side affects on memory

Thread Migration.17 Single Thread crafty(am:rr 4 cores) Normalized SpeedUp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01ms 0.1ms 1ms $W-PC $C-PW $C-PD (AM)

Thread Migration.18 Single Thread crafty: $C-PD (AM) 100% Normailized Execution Time 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Cold Wait Migration Run 0.01ms 0.1ms 1ms

Thread Migration.19 Single Thread crafty: $C-PD (AM) 20 81 55 160 18 Added/KI 16 14 12 10 8 1ms 0.1ms 0.01ms 6 4 2 0 Line Pred. Misses Branch Pred. Misses DL1 Misses IL1 Misses L2 Misses Trap cyc.

Various Single Threads: $C-PD (AM) 100% 80% 60% 40% 20% 0% cold effects waiting time migration time running time apsi.00001 apsi.0001 apsi.001 bzip.00001 bzip.0001 bzip.001 crafty.00001 crafty.0001 crafty.001 gap.00001 gap.0001 gap.001 gzip.00001 gzip.0001 gzip.001 mgrid.00001 mgrid.0001 mgrid.001-20% Thread Migration.20 Normailized Execution Time

Observations for 1-thread AM Performance loss is correlated with migration period Cold effects usually dominate Flushing is not slow Both L2 cache and branch predictor are critical resources to have warm Drowsy predictor sufficient Thread Migration.21

Thread Migration.22 Size, type of resource and working set Cold effects more likely for programs with large footprint in predictor and L2 Size of resources in blocks 64KB IL1 cache: 1024 blocks 64KB DL1 cache: 1024 blocks 16KB Branch Predictor 64K blocks (Entries) Hybrid bimodal/gshare 2MB L2 Cache: 64K blocks long latency

Thread Migration.23 Two Thread(RR 4-cores / swap 2-cores) Normalized Speedup 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.01ms 0.1ms 1ms 0.00 $W-PC $C-PW $C-PD (RR) $C-PD (SWAP) $W-PC $C-PW $C-PD (RR) $C-PD (SWAP) apsi-mgrid bzip-gzip

Thread Migration.24 Two Thread (AM RR 4 cores) 100% Normalized Execution Time 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0.01ms 0.01ms 0.1ms 0.1ms 1ms 1ms 0.01ms 0.01ms 0.1ms 0.1ms 1ms 1ms Cold Wait Migration Run apsi-mgrid bzip-gzip

Thread Migration.25 Possible Issue: non-uniform contribution of threads 4.00E+08 Committed Instructions 3.00E+08 2.00E+08 1.00E+08 0.00E+00 0.01ms 0.1ms 1ms 0.01ms 0.1ms 1ms apsi-mgrid bzip-gzip

Thread Migration.26 Observations for 2-thread AM Performance loss is correlated with migration period Round-robin not worse than swap Good news for temperature Both cold effects and Migration important More contention on bus due to flushing L2 caches at the same time from two threads L2 cache critical resource Drowsy predictor works Minimal interference across threads Issue with non-uniform contribution across AM periods Metric and Scheduling

Three Thread (AM RR 4 cores) mgrid.perl.mcf Normalized SpeedUp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01ms 0.1ms 1ms $W-PC $C-PW $C-PD (AM) Thread Migration.27

Thread Migration.28 Three Thread Distr.(AM RR 4 cores) 100% 80% 60% 40% 20% 0% -20% Cold Wait Migration Run mgrid perl mcf mgrid perl mcf 0.1ms 1ms

Conclusions Simplistic implementation sufficient for some cases BUT definitely not for all Need for low overhead coherence Drowsy predictor same performance as warm Little interference across benches Swap and round-robin same behavior More pronounced effects for benchmarks that use shared resources: bus Thread Migration.29

Thread Migration.30 Ongoing Work: Power Modelling Temperature vs Time 50 49 48 L2_left L2 L2_right Icache Dcache Bpred DTB FPAdd FPReg FPMul FPMap IntMap IntQ IntReg IntExec FPQ LdStQ ITB RIM_top_FPMap RIM_top_IntMap RIM_top_IntQ RIM_top_IntReg RIM_left_L2_left RIM_top_L2_left RIM_lef t_l2 RIM_right_L2 RIM_bottom_L2 RIM_right_L2_right RIM_top_L2_right 47 Temperature (C) 46 45 44 43 42 41 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (s)

Future Work Develop and evaluate low overhead migration techniques for cache coherence Integrate power/temperature model and evaluate sensor based migration Build synthetic simulator for exploring scheduling issues Thread Migration.31

Acknowledgements Funding for this work: Intel Cyprus Research Promotion Foundation HiPEAC T. Constantinou for initial Single Thread Migration studies Thread Migration.32

Thread Migration.33 Questions?