Making OpenVX Really Real Time

Size: px

Start display at page:

Download "Making OpenVX Really Real Time"

Stella Hodges
5 years ago
Views:

1 Making OpenVX Really Real Time Ming Yang 1, Tanya Amert 1, Kecheng Yang 1,2, Nathan Otterness 1, James H. Anderson 1, F. Donelson Smith 1, and Shige Wang 3 1The University of North Carolina at Chapel Hill 2Texas State University 3General Motors Research

2 700 ms

4 A new approach for graph scheduling

5 Shorter response time + Less capacity loss

6 1. State of the art 2. Our approach 3. Future work!6

7 Example OpenVX Graph Graph-based architecture Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing Portability to diverse hardware Application Application GPU FPGA DSP Does OpenVX really target real-time processing?!7 Source:

8 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities Example OpenVX Graph Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing!8 Source:

9 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D!9 Source:

10 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D A B C D Monolithic scheduling A Time!10 Source:

11 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] A C B D Task A: A A Task B: B B Task C: C C Task D: D D Coarse-grained scheduling Time!11

12 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] Remaining problems: 1. More parallelism to be explored 2. Suspension-oblivious analysis was applied and causes capacity loss.!12

13 Fine-Grained Scheduling This Work

14 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!14

15 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!15

16 A C B Suspension for GPU execution D Task A: Task B: Task C: Task D: Coarse-Grained Scheduling Time A C E F G D Task A: Task E: Task F: GPU execution Task G: Task C: Task D: Time Fine-Grained Scheduling!16

17 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!17

18 Deriving Response-Time Bounds for a DAG* Step 1: Schedule the nodes as sporadic tasks Step 2: Compute bounds for every node Step 3: Sum the bounds of nodes on the critical path * C. Liu and J. Anderson, Supporting Soft Real-Time DAG-based Systems on Multiprocessors with No Utilization Loss, in RTSS, 2013.!18

19 Deriving Response-Time Bounds for a DAG A C B E F D!19

20 Deriving Response-Time Bounds for a DAG A C B E F D!20

21 Deriving Response-Time Bounds for a DAG CPU A B C F D GPU E Need a response-time bound analysis for GPU tasks!21

22 A system model of GPU Tasks Per-block worst-case workload τ i = (C i, T i, B i, H i ) Period Number of blocks Number of threads per block (or block size) SM1 SM C 1 T 1 H 1 = 1024 B Time τ 1 = (3076,6,2,1024)!22

23 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples.!23

24 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. Releases: Without intra-task parallelism: With intra-task parallelism: Time!24

25 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. R k SM1 2. We then bound the unfinished workload from jobs released at or before r k,j. SM0 r k,j!25 τk,j Time 3. We prove the job finishes before r k,j + R k.

26 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!26

27 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) vxhogcells vxhogcells vxhogcells Node Node Node vxhogfeature vxhogfeature vxhogfeatures snode snode Node Resize Image Resize Image Resize Image Compute Compute Compute Gradients Gradients Gradients Compute Compute Orientation Orientation Compute Orientation Histograms Histograms Histograms Normalize Normalize Normalize Orientation Orientation Orientation Histograms Histograms Histograms CPU+GPU Execution (Coarse-Grained) GPU Execution (Fine-Grained)!27

28 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) 6 instances 33 ms period 30,000 samples Platform: NVIDIA Titan V GPU + Two eight-core Intel CPUs. Schedulers: G-EDF, G-FL (fair-lateness)!28

29 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling % of samples 50% samples have response time less than 60 ms Left is better Time!29

30 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !30

31 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !31

32 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!32

33 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!33

34 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!34

35 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A!35

36 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!36

37 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!37

38 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness An alert driver takes 700 ms to react. [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!38 [3] [3]

39 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling An alert driver takes 700 ms to react. [1] [2] Fair-lateness-based scheduler is beneficial as it reduced node response times by up to 9.9%. Overheads of supporting fine-grained scheduling was 14.15%.!39 FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A [3] [3]

40 Conclusions 1. Fine-grained scheduling 2. Response-time bounds analysis for GPU tasks 3. Case study!40

41 Future Work 1. Cycles in the graph 2. Other resource constraints 3. Schedulability studies!41

42 Thanks!

Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs

Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs Joshua Bakita Department of Computer Science, University of North Carolina at Chapel Hill 14th Annual Workshop on Operating