Scalable Dynamic Task Scheduling on Adaptive Many-Cores

Size: px

Start display at page:

Download "Scalable Dynamic Task Scheduling on Adaptive Many-Cores"

Grant Glenn
5 years ago
Views:

Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on

Mitra, Jörg Henkel Bus CES Chair for Embedded Systems Multi- Few s Memory Bus Context

ILP Task Task Low ILP Task -way -Way -way 8-Way -way -way TLP Task Task -way -way -way

multi-threaded tasks. s are limited; multiple active tasks want them.

1 Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for Embedded Systems Multi- Few s Memory Bus Context Switching Centralized Scheduler Many-s Dozens of s Network on Chip (NoC) One Thread per Distributed Scheduler ces.itec.kit.edu Introduction: Adaptive Many- Introduction: Performance Maximization Problem High ILP Task Task Low ILP Task -way -Way -way 8-Way -way -way TLP Task Task -way -way -way -way Task Adaptive many-cores allow speedup for both single-threaded and multi-threaded tasks. s are limited; multiple active tasks want them. Goal is to maximize performance (Total Speedup). Give whom, how many? Solvable by Dynamic Programming optimally; but centralized. Our contribution: A Distributed Solution.

2 Introduction: Varying Speedup Speedup extracted by a task changes over its execution lifetime based on requirements. Speedup ( ) x Million Instructions bzip lbm For malleable tasks improve performance by moving cores from low speedup tasks to high speedup tasks. System Speedup Introduction: Motivational Example Dynamic core reallocation of cores between two tasks every scheduling epoch results in higher throughput compared to a static equal core allocation x Million Cycles Throughput.5 (Static).8 (Dynamic) Static Equal Dynamic Optimal 5 6 Introduction: Concavity in Speedup Speedup Number of Assigned s astar bzip applu art disparity mser blackscholes swaptions Algorithm: DPMS We present an algorithm called DPMS. Uses a Multi-Agent System. Uses a regression-based performance-prediction model also introduced in this work. A B Speedup in task is monotonically increasing and submodular, due to TLP and ILP saturation. No stability without concavity; cyclic oscillations otherwise. s Tasks Gain 7 8

3 Algorithm: Convergence [Theorem ] and Optimal [Theorem ] Regression-Based Performance Prediction for ILP Tasks CPI = Steady-State CPI + Miss CPI A B Estimate Steady-State CPI Estimate Miss CPI Estimate Steady CPI on Different Size CPU Allocation System is guranteed to converge to a solution in O (Tasks) number of steps. Converged solution is optimal. Estimate Miss CPI on Different Size CPU Allocation 9 Accuracy: %Error Training Set DPMS Testing Set Name - - Name - - Art 6.. Calculix 7.. Astar..6 Gemsftd Bwaves.77. Gobmk 7.. Bzip 8.. Hmmer 9..5 Disparity.5.9 Lbm. 7.9 Equake Mser H6ref Namd.. Mcf.5.6 Povray 8.. Omnetpp. 7.9 Sift. 9.7 Perlbench.5. Texture.. Tracking. 8.7 Tonto.6.7 Svm 6.5. Twolf 8. 5.

4 Worst-Case Complexity: Dynamic Prog. vs DPMS Results: Experimental Setup C: Number of s T: Number of Tasks Total Calculations Per- Calculations Communication Dynamic Programming DPMS O (T C) O (T C) O (T C) O (T ) O (T) O (T ) Space O (T C) O () Results: Rounds to Convergence Results: Performance Number of Rounds Required Full System Reconfiguration Partial System Reconfiguration x Million Cycles Rounds On a 56-core system, same performance for DPMS and Dynamic Programming. Speedup Number of Tasks Dynamic Programming CGT-MAS 5 6

Results: Changes in s Normalized Per- Processing - - -6-8 - - 6 8 56

Communication 8 6 6 8 56 Number Of s x Increase in Communication

better than x increase in communication overhead?

5 Results: Changes in s Normalized Per- Processing Number Of s x Decrease in Per- Processing Normalized Per- Communication Number Of s x Increase in Communication Thank You Unanswered Question: Is x decrease in processing overhead better than x increase in communication overhead? There also exists an efficient centralized greedy scheduler for the problem. Questions? 7 8

Distributed Scheduling for Many-Cores Using Cooperative Game Theory

Distributed Scheduling for Many-Cores Using Cooperative Game Theory Anuj Pathania *, Vanchinathan Venkataramani, Muhammad Shafique *, Tulika Mitra, Jörg Henkel * * Chair of Embedded System (CES), Karlsruhe