MARACAS: A Real-Time Multicore VCPU Scheduling Framework

Size: px

Start display at page:

Download "MARACAS: A Real-Time Multicore VCPU Scheduling Framework"

Cuthbert Strickland
5 years ago
Views:

1 : A Real-Time Framework Computer Science Department Boston University

2 Overview

3 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less circuit area lower power consumption lower cost

4 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less circuit area lower power consumption lower cost Complex on-chip memory hierarchies pose significant challenges for applications with real-time requirements

5 Motivation Shared cache contention: page coloring hardware cache partitioning (Intel CAT) static VS dynamic

6 Motivation Shared cache contention: page coloring hardware cache partitioning (Intel CAT) static VS dynamic Memory bus contention: bank-aware memory management memory throttling

7 Contribution We proposed the use of foreground (reservation) + background (surplus) scheduling model improves application performance effectively reduces resource contention well-integrated with real-time scheduling algorithms

8 Contribution We proposed the use of foreground (reservation) + background (surplus) scheduling model improves application performance effectively reduces resource contention well-integrated with real-time scheduling algorithms We proposed a new bus monitoring metric that accurately detects traffic

9 Application Imprecise computation/numeric integration MPEG video decoding: mandatory to process I-frames, optional to process B- and P-frames to improve frame rate

10 Application Imprecise computation/numeric integration MPEG video decoding: mandatory to process I-frames, optional to process B- and P-frames to improve frame rate Mixed-criticality systems running performance-demanding applications machine learning computer vision

11 model (C, T) in C: Capacity T: Period

12 model (C, T) in C: Capacity T: Period Partitioned scheduling using RMS

13 model (C, T) in C: Capacity T: Period Partitioned scheduling using RMS Schedulability test n 1 ( C i T i ) n( n 2 1)

14 enters background mode upon depleting its budget (C)

15 enters background mode upon depleting its budget (C) Core enters background mode when all s are in background mode

16 enters background mode upon depleting its budget (C) Core enters background mode when all s are in background mode CPU Time (BGT): time a runs when core in background mode

17 enters background mode upon depleting its budget (C) Core enters background mode when all s are in background mode CPU Time (BGT): time a runs when core in background mode scheduling: schedule s when core is in background mode fair share of BGT amongst s on core

18 DRAM structure

19 Prior work [MemGuard] uses Rate Metric : number of DRAM accesses over a certain period

20 Prior work [MemGuard] uses Rate Metric : number of DRAM accesses over a certain period Bank-level parallelism

21 Prior work [MemGuard] uses Rate Metric : number of DRAM accesses over a certain period Bank-level parallelism Row buffers

22 Prior work [MemGuard] uses Rate Metric : number of DRAM accesses over a certain period Bank-level parallelism Row buffers Sync Effect

23 Sync Effect

24 Sync Effect Each task reduces its access rate by a factor of (T-t)/T Contention in [0, t] remains the same

25 Latency Metric requests = 3, occupancy = 10

26 Latency Metric requests = 3, occupancy = 10 latency = 10 3 = 3.3

27 Latency Metric UNC ARB TRK REQUEST.ALL (requests): counts all memory requests going to the memory controller request queue UNC ARB TRK OCCUPANCY.ALL (occupancy): counts cycles weighted by the number of pending requests in the queue

28 Latency Metric UNC ARB TRK REQUEST.ALL (requests): counts all memory requests going to the memory controller request queue UNC ARB TRK OCCUPANCY.ALL (occupancy): counts cycles weighted by the number of pending requests in the queue Average latency: latency = occupancy requests

29 Memory Throttling When core gets throttled, background scheduling is disabled

30 Memory Throttling When core gets throttled, background scheduling is disabled Latency threshold: MAX MEM LAT if latency MAX MEM LAT then num throttle + +

31 Memory Throttling When core gets throttled, background scheduling is disabled Latency threshold: MAX MEM LAT if latency MAX MEM LAT then num throttle + + Proportional throttling Every core is throttled at some point Throttled time proportional to core s DRAM access rate

32 Predictable Migration

33 Predictable Migration Run migration thread with highest priority on each core: pushing local s to other cores (starts from highest utilization ones)

34 Predictable Migration Run migration thread with highest priority on each core: pushing local s to other cores (starts from highest utilization ones) Only one migration thread active during a migration period

35 Predictable Migration Run migration thread with highest priority on each core: pushing local s to other cores (starts from highest utilization ones) Only one migration thread active during a migration period Its execution of its entire capacity C does not lead to any other local s missing their deadlines

36 Predictable Migration Run migration thread with highest priority on each core: pushing local s to other cores (starts from highest utilization ones) Only one migration thread active during a migration period Its execution of its entire capacity C does not lead to any other local s missing their deadlines Constraint on C: C 2 E lock + E struct

37 Load Balancing For every core, define Slack-Per- (SPV): SPV = 1 n 1 (C i /T i ) n

38 Load Balancing For every core, define Slack-Per- (SPV): SPV = 1 n 1 (C i /T i ) n SPV = 1 (10%+30%) 2 = 30%

39 Load Balancing Balance CPU Time (BGT) used by every across cores: equalize SPVs of all cores

40 Load Balancing Balance CPU Time (BGT) used by every across cores: equalize SPVs of all cores BGT fair sharing

41 Load Balancing Balance CPU Time (BGT) used by every across cores: equalize SPVs of all cores BGT fair sharing balanced memory throttling capability on each core

42 Load Balancing Balance CPU Time (BGT) used by every across cores: equalize SPVs of all cores BGT fair sharing balanced memory throttling capability on each core

43 Cache- Static cache partitioning amongst cores page coloring

44 Cache- Static cache partitioning amongst cores page coloring New API: bool vcpu create(uint C, uint T, uint cache);

45 Cache- Static cache partitioning amongst cores page coloring New API: bool vcpu create(uint C, uint T, uint cache); Extension of Load Balancing: destination core meets the cache requirement

46 running on the following hardware platform:

47 Rate VS Latency Micro-benchmark m jump: byte array[6m]; for (uint32 j = 0; j < 8K; j += 64) for (uint32 i = j; i < 6M; i += 8K) < Variable delay added here > (uint32)array[i] = i;

48 Rate VS Latency Micro-benchmark m jump: byte array[6m]; for (uint32 j = 0; j < 8K; j += 64) for (uint32 i = j; i < 6M; i += 8K) < Variable delay added here > (uint32)array[i] = i; Three m jump (task 1,2,3) running on separate cores without memory throttling, utilization (C/T) 50%

49 Rate VS Latency Micro-benchmark m jump: byte array[6m]; for (uint32 j = 0; j < 8K; j += 64) for (uint32 i = j; i < 6M; i += 8K) < Variable delay added here > (uint32)array[i] = i; Three m jump (task 1,2,3) running on separate cores without memory throttling, utilization (C/T) 50% Each run, insert a different time delay in task1 and task2, task3 has no delay

50 Rate VS Latency Record the total memory bus traffic, average memory request latency and task3 s instructions retired in foreground mode:

51 Rate VS Latency Record the total memory bus traffic, average memory request latency and task3 s instructions retired in foreground mode: Bus Traffic (GB) Latency task3 Instructions Retired ( 10 8 ) H M L

52 Rate VS Latency Record the total memory bus traffic, average memory request latency and task3 s instructions retired in foreground mode: Bus Traffic (GB) Latency task3 Instructions Retired ( 10 8 ) H M L Setting comparable thresholds: rate-based: derived from Bus Traffic (1128/time) latency-based: from Latency (228)

53 Rate VS Latency Record the total memory bus traffic, average memory request latency and task3 s instructions retired in foreground mode: Bus Traffic (GB) Latency task3 Instructions Retired ( 10 8 ) H M L Setting comparable thresholds: rate-based: derived from Bus Traffic (1128/time) latency-based: from Latency (228) Last column serves as reference, showing the expected performance of task3 using the corresponding thresholds

54 Rate VS Latency Repeat experiment with memory throttling enabled and fixed delay for task1/task2

55 Rate VS Latency Repeat experiment with memory throttling enabled and fixed delay for task1/task2

56 uses background time to improve task performance; when memory bus is contended, it gets disabled through throttling

57 uses background time to improve task performance; when memory bus is contended, it gets disabled through throttling uses a latency metric to trigger throttling, outperforming prior rate-based approach

58 uses background time to improve task performance; when memory bus is contended, it gets disabled through throttling uses a latency metric to trigger throttling, outperforming prior rate-based approach fairly distributes background time across cores, for both fairness and better throttling

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,