Oversubscription on Multicore Processors

Size: px

Start display at page:

Download "Oversubscription on Multicore Processors"

Caitlin Burke
5 years ago
Views:

1 Oversubscription on Multicore Processors ostin Iancu, teven Hofmeyr, Filip lagojević, Yili Zheng Lawrence erkeley National Laboratory Parallel & Dtributed Processing (IPDP), /

2 Motivation Increasingly parallel and asymmetric hardware (architecture + performance) Exting runtimes in competitive environments Partitioning vs. sharing on real hardware /

3 Oversubscription + ompensate for data and control dendencies Decrease resource contention Improve PU utilization Overhead for migration, context switching and lost hardware state (negligible) lower synchronization due to increased contention 3 /

4 etup MPI (MPIH ), UP, OpenMP ynchronization: poll + yield Linux.6.7,.6.8,.6.3 Intel compiler with O3 NP without load imbalances (sarate paper) Processor lock GHz ores L data/instr L cache L3 cache Memory/core NUM Tigerton Intel Xeon E (4x4) 3K/3K 4M / cores none G no arcelona MD Opteron (4x4) 64K/64K 5K / core M / socket 4G socket Nehalem Intel Xeon E (x4x) 3K/3K 56K / core 8M / socket.5g / core socket Table. Test systems arrier Performance MD arcelona 6 (ms) UP NP.4 arrier tats, 6 threads /

5 enchmark haractertics Nehalem Intel Xeon E (x4x) 3K/3K 56K / core 8M / soc Time (microsec) arrier Performance MD arcelona /core /core 4/core /core /core 4/core /core /core 4/core Table. Test systems. Inter-barrier time (ms) U UP OpenMP MPI Figure. arrier performance with oversubscription at different core counts (legend) on MD arcelona. imilar results are observed on all systems..8. compiler which uses -O3 for the icc backend compiler, while the Fortran benchmarks were compiled with -fast, which includes -O3. Unless ecified otherwe, OpenMP compiled and executed with static scheduling. We have used Figure. verage rier count for the UP trends are observed o 4. enchmar Figure presents mentations for the the presence of o number of threads 5 /

6 enchmark haractertics Nehalem Intel Xeon E (x4x) 3K/3K 56K / core 8M / soc Time (microsec) arrier Performance MD arcelona lock GHz ores L data/instr L cache L3 cache Memory/core NUM.6 6 (4x4) 3K/3K 4M / cores none G no 6 (4x4) 64K/64K 5K / core /core /core 4/core /core /core 4/core /core /core 4/core M / socket 4G socket.4 6 (x4x) 3K/3K 56K / core 8M UP / socket.5g / OpenMP core socket MPI elona /core 4/core MPI Table. Test systems. Figure. arrier performance with oversubscription at different core counts UP NP (legend).4 arrier tats, on MD 6 threads arcelona. imilar results are observed on all 6 systems oversubscription Inter-barrier time (ms) Table. Test systems Inter-barrier time (ms) U Figure. verage rier count for the UP trends are observed o 4. enchmar compiler which 56 uses -O3 for the icc backend compiler, 7877 while the Fortran benchmarks 7688 were Figure presents compiled with -fast, which includes -O3. Unless ecified otherwe, OpenMP compiled bt and 3777 mentations for the. the presence of o executed with static scheduling. We have used number of threads 5 / Figure. verage time between two barriers and bar-

7 UP UM vs. NUM ache Performance relative to /core.5.5 UP Tigerton F PX yield PIN Performance relative to /core UP arcelona.5 sched_yield: default vs. POIX Pinning.5 affects variance ( % vs. %) and memory affinity PX Figure 3. UM oversubscription UP. Performance normalized to that of experiments with task per core. Number of tasks per core can be, 4 or 8. P requires a square number of threads. Overall workload performance varies from -% to %. Figure 4. NUM oversubscription UP mance normalized to that of experimen task per core. Number of tasks per core ca or 8. P requires a square number of thread workload performance varies from -% to ement over /core alance UP Tigerton ement over /core.4. ms rate UP Tigerton 6 /

8 UP UM vs. NUM ache Performance relative to /core UP Tigerton F PX yield PIN Figure 3. UM oversubscription UP. Performance normalizedup to arcelona that of experiments Fwith PX yield PIN task per core. Number of tasks per core can be, 4.5 or 8. P requires a square number of threads. Overall workload performance varies from -% to %. ement over /core Performance relative to /core alance UP Tigerton Performance relative to /core.5 UP arcelona sched_yield: default vs. POIX Pinning.5 affects variance ( % vs. %) and memory affinity PX mall overall effect (± % avg) Figure EP: computationally 4. NUM oversubscription intensive UP mance normalized to that of experimen task FT, per I: core. improvement Numberup of to tasks 46 % per core ca orp, 8. P MG: requires problema square size number of thread workload granularity performance varies from -% to G: degradation up to 44 % ement over /core.4. ms rate UP Tigerton 6 /

9 alanceworkload performance varies from -% to %. workl alance UP Tigerton Improvement over /core Improvement over /core Figure 5. hanges in balance on UM, rorted as the ratio between the lowest and highest user time across all cores compared to the /core setting. Figu mse pared 7 /

10 ache Ms workload Rate (LL performance / L) varies from -% to % Improvement over /core ache ms rate UP Tigerton g rted time Figure 6. hanges in the total number of cache mses per instructions, across all cores compared to /core. The EP ms rate very low. 8 /

11 as the ratio between the lowest and highest user time MPI across andallopenmp cores compared to the /core setting. mses per instructions, across all co pared to /core. The EP ms rate very lo Performance relative to /core MPI Tigerton F PX yield PIN Performance relative to /core OMP Nehalem.5 Overall decrease by % aused by barrier overhead (cp..5 modified UP) 48 PX Figure 7. UM oversubscription MPI. Performance normalized to that of experiments with task per core. Number of tasks per core can be or 4. Overall workload performance decreases by % to 8%. e Performance.. OMP arcelona, KMP_LOKTIME= DEF/KMP DEF/KMP DEF4/KMP4 Figure 8. NUM oversubscription Open formance normalized to that of experim task per core. Number of tasks per core 4 or 8. Workload performance decreases 4%. a.ve Performance...9 OMP arcelona, KMP_LOKTIME=in DEF/INF DEF/INF DEF4/INF4 9 /

12 as the ratio between ache ms the lowest rate UP and Tigerton highest user time across all cores compared to the /core setting..4 MPI and OpenMP ed e 4 Performance relative Improvement to /core over /core MPI Tigerton F PX yield PIN Figure 6. hanges in the total number ofcache mses per instructions, across all cores compared Figure to /core. 7. UM The oversubscription EP ms rate very MPI. low. Performance normalizedomp to that Nehalem of experiments Fwith PX yield PIN task per core. Number of tasks per core can be or.5 4. Overall workload performance decreases by % to 8%. Performance relative to /core e Performance..5.. OMP arcelona, KMP_LOKTIME= DEF/KMP DEF/KMP DEF4/KMP mses per instructions, across all co pared to /core. The EP ms rate very lo Performance relative to /core OMP Nehalem PX Figure 8. NUM oversubscription Open formance normalized to that of experim task light perdegradation core. Number of tasks per core 4 orest 8. Workload performance performance with OMP_TTI decreases 4%. a.ve Performance Overall decrease by % aused by barrier overhead (cp..5 modified UP) KMP_LOKTIME OMP arcelona, KMP_LOKTIME=in Improvement up to % for fine-grained benchmarks est overall performance...9 DEF/INF DEF/INF DEF4/INF4 9 /

13 ompetitive Environments haring (best effort) vs. Partitioning (olated on sockets) One thread per core Overall 33 %/3 % improvement with sharing for UP/OpenMP on arcelona (MP) but no difference for Nehalem (MT) etter for application with differing behavior Oversubscription... improves benefits of sharing for MP changes relative order of performance for UP, MPI, OpenMP Imbalanced sharing possible /

14 onclusion Intuitively, oversubscription increases diversity in the system and decreases the potential for resource conflicts. ll of our results and analys indicate that the best predictor of application behavior when oversubscribing the average inter-barrier interval. pplications with barriers executed every few ms are affected, while coarser grained applications are oblivious or their performance improves. We expect the benefits of oversubscription to be even more pronounced for irregular applications that suffer from load imbalance. /

Oversubscription on Multicore Processors

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng Lawrence Berkeley National Laboratory Berkeley, USA {cciancu,shofmeyr,fblagojevic,yzheng}@lbl.gov Abstract