Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec Department of Computer Science, Nicosia IRISA/INRIA Campus de Beaulieu 35042 Rennes Cedex, France

Motivation Computer industry conflicting challenges: Market calls for high performance cores Need for low cost, low power and short time to market One of the industry responses is single chip multi-core processors Multi-cores can increase performance with thread and program level parallelism Thread Migration.2

Temperature Wall But parallel execution can result in Overall higher power consumption Non-uniform power-density map Temperature increase Produced by ATMI using a simulated power density map http://www.irisa.fr/caps/projects/atmi/ Thread Migration.3

Thread Migration.4 Harness Temperature (1) Improve packaging and cooling (exotic/expensive) Temperature is not a packaging only issue anymore: important for circuit, microarchitecture, OS design

Harness Temperature (2) DTM techniques: turn-off clock operate cores at a lower voltage and frequency to ensure correct and/or efficient operation Cost: lower performance This work aim: solve the temperature problem with minimal performance loss using Activity Migration (AM) Thread Migration.5

Activity Migration (1) Leverage multi-cores to distribute activity and temperature over the entire chip Basic Idea: transfer thread from a hot core to a cold core (swap threads). Ex. 4 cores and 4 threads* *A study of Thread Migration in Temperature Constrained Multicores P. Michaud and, to appear in TACO 2007 Thread Migration.6

Activity Migration (2) Prefer AM with minimum performance and design overheads OS can implement AM. BUT the shorter the migration period the larger the temperature reduction* *Scheduling Issues on Thermally Constrained Processors, P. Michaud and, IRISA TR 6006, 2006 Thread Migration.7

Thread Migration.8 Activity Migration (3) At what layer to implement AM: Shorter OS scheduling periods, or/and Hardware support for faster activity migration This talk Present initial results that quantify the performance implications of thread AM on a four core system. Understand the bottlenecks, determine possible ways to minimize AM overhead

Others on Activity Migration Intel introduce term Core-Hopping in 2002 Hao et al ISLPED2003 proposed the term AM Intel in 80-core TSTP uses core-hopping to deal with hotspots, EE Times Jan. 2007 Thread Migration.9

Outline Multi-core Architecture and Activity Migration Model Experimental Framework Results Conclusions Future Work Thread Migration.10

Thread Migration.11 Multi-core Architecture-private L1/L2 Core 1 Core 2 Core 3 Core 4 DL1 IL1 DL1 IL1 DL1 IL1 DL1 IL1 L2 $ L2 $ L2 $ L2 $ Main Memory

Activity Migration Model Assume fixed migration frequency for all cores Every migration for each running thread: Assign a new core to the thread Empty current core pipeline Flush DL1 and L2 to maintain coherence Simple and possibly low performing When more than 1 thread running all cores will try to flush their L2 simultaneously Predictors state not flushed Transfer register to new core (PC, registers) When new core ready resume execution Thread Migration.12

Activity Migration Time Overhead Direct: empty pipe + flush caches + transfer register + wait for new core to finish flushing Indirect Cold Effects: cold caches stale predictor state OR other thread(s) updating predictor Thread Migration.13

Thread Migration.14 AM Design Space Explored Workload size: 1-3 threads Number of cores: 4 Migration frequency: 1ms, 0.1ms, 0.01ms Effect of cold/warm L1/L2 caches and cold/warm branch predictor Caches Warm Predictor Warm (Baseline no AM) Caches Warm Predictor Cold (predictor implicat.) Caches Cold Predictor Warm (cache implications) Caches Cold Predictor Drowsy Activity Migration AM patterns: Round-robin (RR) over all cores Swaps (SWAP) between two cores (for two-thread workload)

Thread Migration.15 Experimental Framework Simulator Core based on a modified version of sim-alpha Multi-core simulator is a parallel program Using pthreads (Donald and Martonosi CAL2006) Multi-core Configuration OOO Core: 4-way superscalar 15 pipeline stages 3GHz frequency Icache/Dcache: 64B block, 64KB, 2way, 1/3 cycles L2: 2MB, 4-way, 10 cycles access latency Predictor: 16KB combining predictor (bimodal-gshare) Shared memory bus 8 Bytes wide, 500MHz Split transaction bus between L2 and memory arbitration based on OS thread scheduling order 200 cycles minimum memory latency

Thread Migration.16 Benchmarks, Methodology and Metrics SPEC CPU2000 Fast forward 1billion instructions, executing until the slowest thread commits at least 100million instructions Performance Metric: How faster is multi-core run over the same workload when run sequentially Normalized to the Speedup of CMP baseline with no AM Assumption: register transfer latency for AM has no latency and no side affects on memory

Thread Migration.17 Single Thread crafty(am:rr 4 cores) Normalized SpeedUp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01ms 0.1ms 1ms $W-PC $C-PW $C-PD (AM)

Thread Migration.18 Single Thread crafty: $C-PD (AM) 100% Normailized Execution Time 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Cold Wait Migration Run 0.01ms 0.1ms 1ms

Thread Migration.19 Single Thread crafty: $C-PD (AM) 20 81 55 160 18 Added/KI 16 14 12 10 8 1ms 0.1ms 0.01ms 6 4 2 0 Line Pred. Misses Branch Pred. Misses DL1 Misses IL1 Misses L2 Misses Trap cyc.

Various Single Threads: $C-PD (AM) 100% 80% 60% 40% 20% 0% cold effects waiting time migration time running time apsi.00001 apsi.0001 apsi.001 bzip.00001 bzip.0001 bzip.001 crafty.00001 crafty.0001 crafty.001 gap.00001 gap.0001 gap.001 gzip.00001 gzip.0001 gzip.001 mgrid.00001 mgrid.0001 mgrid.001-20% Thread Migration.20 Normailized Execution Time

Observations for 1-thread AM Performance loss is correlated with migration period Cold effects usually dominate Flushing is not slow Both L2 cache and branch predictor are critical resources to have warm Drowsy predictor sufficient Thread Migration.21

Thread Migration.22 Size, type of resource and working set Cold effects more likely for programs with large footprint in predictor and L2 Size of resources in blocks 64KB IL1 cache: 1024 blocks 64KB DL1 cache: 1024 blocks 16KB Branch Predictor 64K blocks (Entries) Hybrid bimodal/gshare 2MB L2 Cache: 64K blocks long latency

Thread Migration.23 Two Thread(RR 4-cores / swap 2-cores) Normalized Speedup 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.01ms 0.1ms 1ms 0.00 $W-PC $C-PW $C-PD (RR) $C-PD (SWAP) $W-PC $C-PW $C-PD (RR) $C-PD (SWAP) apsi-mgrid bzip-gzip

Thread Migration.24 Two Thread (AM RR 4 cores) 100% Normalized Execution Time 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0.01ms 0.01ms 0.1ms 0.1ms 1ms 1ms 0.01ms 0.01ms 0.1ms 0.1ms 1ms 1ms Cold Wait Migration Run apsi-mgrid bzip-gzip

Thread Migration.25 Possible Issue: non-uniform contribution of threads 4.00E+08 Committed Instructions 3.00E+08 2.00E+08 1.00E+08 0.00E+00 0.01ms 0.1ms 1ms 0.01ms 0.1ms 1ms apsi-mgrid bzip-gzip

Thread Migration.26 Observations for 2-thread AM Performance loss is correlated with migration period Round-robin not worse than swap Good news for temperature Both cold effects and Migration important More contention on bus due to flushing L2 caches at the same time from two threads L2 cache critical resource Drowsy predictor works Minimal interference across threads Issue with non-uniform contribution across AM periods Metric and Scheduling

Three Thread (AM RR 4 cores) mgrid.perl.mcf Normalized SpeedUp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01ms 0.1ms 1ms $W-PC $C-PW $C-PD (AM) Thread Migration.27

Thread Migration.28 Three Thread Distr.(AM RR 4 cores) 100% 80% 60% 40% 20% 0% -20% Cold Wait Migration Run mgrid perl mcf mgrid perl mcf 0.1ms 1ms

Conclusions Simplistic implementation sufficient for some cases BUT definitely not for all Need for low overhead coherence Drowsy predictor same performance as warm Little interference across benches Swap and round-robin same behavior More pronounced effects for benchmarks that use shared resources: bus Thread Migration.29

Thread Migration.30 Ongoing Work: Power Modelling Temperature vs Time 50 49 48 L2_left L2 L2_right Icache Dcache Bpred DTB FPAdd FPReg FPMul FPMap IntMap IntQ IntReg IntExec FPQ LdStQ ITB RIM_top_FPMap RIM_top_IntMap RIM_top_IntQ RIM_top_IntReg RIM_left_L2_left RIM_top_L2_left RIM_lef t_l2 RIM_right_L2 RIM_bottom_L2 RIM_right_L2_right RIM_top_L2_right 47 Temperature (C) 46 45 44 43 42 41 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (s)

Future Work Develop and evaluate low overhead migration techniques for cache coherence Integrate power/temperature model and evaluate sensor based migration Build synthetic simulator for exploring scheduling issues Thread Migration.31

Acknowledgements Funding for this work: Intel Cyprus Research Promotion Foundation HiPEAC T. Constantinou for initial Single Thread Migration studies Thread Migration.32

Thread Migration.33 Questions?