A Server-based Approach for Predictable GPU Access Control

Similar documents
A Server-based Approach for Predictable GPU Access with Improved Analysis

A Server-based Approach for Predictable GPU Access Control

Real-Time Cache Management for Multi-Core Virtualization

Predictable GPU Arbitration for Fixed-Priority Real-Time Systems

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions

A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications

Parallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car

Multiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous

AUTOSAR Extensions for Predictable Task Synchronization in Multi- Core ECUs

Resource Sharing among Prioritized Real-Time Applications on Multiprocessors

Multiprocessor Synchronization and Hierarchical Scheduling

An Evaluation of the Dynamic and Static Multiprocessor Priority Ceiling Protocol and the Multiprocessor Stack Resource Policy in an SMP System

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

Fast on Average, Predictable in the Worst Case

Carnegie Mellon University. Hard Real-Time Multiprocessor Scheduling (Part II) Marcus Völp

Introduction to Embedded Systems

Fixed-Priority Multiprocessor Scheduling. Real-time Systems. N periodic tasks (of different rates/periods) i Ji C J. 2 i. ij 3

Microkernel/OS and Real-Time Scheduling

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Multimedia Systems 2011/2012

QuartzV: Bringing Quality of Time to Virtual Machines

Nested Multiprocessor Real-Time Locking with Improved Blocking

Fixed-Priority Multiprocessor Scheduling

Improved Analysis and Evaluation of Real-Time Semaphore Protocols for P-FP Scheduling

Replica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation

4/6/2011. Informally, scheduling is. Informally, scheduling is. More precisely, Periodic and Aperiodic. Periodic Task. Periodic Task (Contd.

Analyzing Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

MC-SDN: Supporting Mixed-Criticality Scheduling on Switched-Ethernet Using Software-Defined Networking

Scheduling Algorithm and Analysis

Sporadic Server Scheduling in Linux Theory vs. Practice. Mark Stanovich Theodore Baker Andy Wang

The Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches

WMC. MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities. Mohamed Hassan

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Making OpenVX Really Real Time

An implementation of the flexible spin-lock model in ERIKA Enterprise on a multi-core platform

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Investigation of System Timing Concerns in Embedded Systems: Tool-based Analysis of AADL Models

Multiprocessor and Real-Time Scheduling. Chapter 10

A Synchronous IPC Protocol for Predictable Access to Shared Resources in Mixed-Criticality Systems

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

Coordinated Cache Management for Predictable Multi-Core Real-Time Systems

Constructing and Verifying Cyber Physical Systems

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Resource Reservation & Resource Servers

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Real-time Support in Operating Systems

Embedded Systems. 6. Real-Time Operating Systems

An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems

Implementing Sporadic Servers in Ada

Globally Scheduled Real-Time Multiprocessor Systems with GPUs

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

A Comparison of the M-PCP, D-PCP, and FMLP on LITMUS RT

CS4514 Real Time Scheduling

Empirical Approximation and Impact on Schedulability

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Development of Real-Time Systems with Embedded Linux. Brandon Shibley Senior Solutions Architect Toradex Inc.

Supporting Nested Locking in Multiprocessor Real-Time Systems

Blocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks

REPORT DOCUMENTATION PAGE Form Approved OMB NO

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Interference-Aware Real-Time Flow Scheduling for Wireless Sensor Networks

Supporting Nested Locking in Multiprocessor Real-Time Systems

Simplified design flow for embedded systems

Real-Time Scheduling of Sensor-Based Control Systems

Embedded Systems: OS. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CEC 450 Real-Time Systems

An Implementation of the PCP, SRP, D-PCP, M-PCP, and FMLP Real-Time Synchronization Protocols in LITMUS RT

Introduction to Real-Time Operating Systems

An FPGA Implementation of Wait-Free Data Synchronization Protocols

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018

An Improved Priority Ceiling Protocol to Reduce Context Switches in Task Synchronization 1

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

1.1 CPU I/O Burst Cycle

Introduction to Embedded Systems

Hardware Support for Priority Inheritance

Exam Review TexPoint fonts used in EMF.

CPU Scheduling. Schedulers. CPSC 313: Intro to Computer Systems. Intro to Scheduling. Schedulers in the OS

A Flexible Multiprocessor Resource Sharing Framework for Ada

Organic Self-organizing Bus-based Communication Systems

Giancarlo Vasta, Magneti Marelli, Lucia Lo Bello, University of Catania,

RT#Xen:(Real#Time( Virtualiza2on(for(the(Cloud( Chenyang(Lu( Cyber-Physical(Systems(Laboratory( Department(of(Computer(Science(and(Engineering(

Real-Time Systems. Hard Real-Time Multiprocessor Scheduling

Resource sharing and blocking. The Mars Pathfinder Unbounded priority inversion Priority inheritance

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Mechanisms for Guaranteeing Data Consistency and Flow Preservation in AUTOSAR Software on Multi-core Platforms

ARTIST-Relevant Research from Linköping

SMD149 - Operating Systems

Reference Model and Scheduling Policies for Real-Time Systems

Formal Analysis of Timing Effects on Closed-loop Properties of Cyber Physical Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

CISC 7310X. C05: CPU Scheduling. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/1/2018 CUNY Brooklyn College

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

IN4343 Real-Time Systems

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

International Journal of Advanced Research in Computer Science and Software Engineering

Transcription:

A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D

Benefits of GPUs High computational demands of recent safety-critical systems Long execution times Hard to meet their deadlines General-Purpose Graphics Processing Units (GPUs) 4-20x faster than a CPU for data-parallel, compute-intensive workloads * Many embedded multi-core processors have on-chip GPUs NVIDIA TK1/TK2 NXP i.mx6 Samsung Exynos 9 * Govindaraju et al. High Performance Discrete Fourier Transforms on Graphics Processors. ACM/IEEE conference on Supercomputing (SC), 2008. 2

GPU Execution Pattern Task accessing a GPU CPU Normal exec. segment GPU access segment Normal exec. segment 1 Copy data to GPU GPU 2 Trigger GPU computation 3 Notify completion GPU kernel execution 4 Copy results to CPU Need for predictable GPU access control To bound and minimize GPU access time To achieve better task schedulability 3

Many of Today s COTS GPUs 1. Do not support preemption Due to the high overhead expected on GPU context switching * Some recent GPU architectures support preemption (e.g., NVIDIA Pascal) 2. Handle GPU requests in a sequential manner Concurrent execution of GPU kernels may result in unpredictable delay Four same kernels on NVIDIA GTX 980 97% slowdown on two kernels Unpredictable who gets the delay 3. Do not respect task priorities and the scheduling policy used May result in unbounded priority inversion * I. Tanasic et al. Enabling preemptive multiprogramming on GPUs. In International Symposium on Computer Architecture (ISCA), 2014. 4

Prior Approach Synchronization-based approach * Models each GPU access segment as a critical section Uses a real-time synchronization protocol to handle GPU requests CPU Lock GPU access segment Critical section Unlock GPU Benefits Does not require any change in GPU device drivers Existing schedulability analyses can be directly reused * G. Elliott and J. Anderson. Globally scheduled real-time multiprocessor systems with GPUs. Real-Time Syst., 48(1):34 74, 2012. G. Elliott and J. Anderson. An optimal k-exclusion real-time locking protocol motivated by multi-gpu systems. Real-Time Syst., 49(2):140 170, 2013. G. Elliott et al. GPUSync: A framework for real-time GPU management. In IEEE Real-Time Systems Symposium (RTSS), 2013. 5

Limitations 1. Busy waiting Critical sections are executed entirely on the CPU No suspension during the execution of a critical section Common assumptions of most RT synch. protocols, e.g., MPCP*, FMLP, OMLP CPU Lock Critical section Busy wait Unlock GPU 2. Long priority inversion High priority tasks may suffer from unnecessarily long priority inversion Due to priority boosting used by some protocols, e.g., MPCP and FMLP * R. Rajkumar, L. Sha, and J. P. Lehoczky. Real-time synchronization protocols for multiprocessors. In IEEE Real-Time Systems Symposium (RTSS), 1988. A. Block et al. A flexible real-time locking protocol for multiprocessors. In IEEE Embedded and Real-Time Comp. Systems and Apps., (RTCSA), 2007. B. Brandenburg and J. Anderson. The OMLP family of optimal multiprocessor real-time locking protocols. Design Automation for Embedded Systems, 17(2):277 342, 2013. 6

Our Contributions Server-based approach for predictable GPU access control Addresses the limitations of the synchronization-based approach Yields CPU utilization benefits Reduces task response time Prototype implementation on an NXP i.mx6 running Linux Can be used for other types of computational accelerators, such as a digital signal processor (DSP) 7

Outline Introduction and motivation Server-based approach for predictable GPU access control System model GPU server and analysis Comparison with the synchronization-based approach Evaluation Conclusions 8

System Model Single general-purpose GPU device Shared by tasks in a sequential, non-preemptive manner Sporadic tasks with constrained deadlines Task ττ ii (CC ii, TT ii, DD ii, GG ii, ηη ii ) CC ii : Sum of the WCET of all normal execution segments TT ii : Minimum inter-arrival time DD ii : Relative deadline GG ii : Max. accum. length of all GPU segments ηη ii : Number of GPU access segments GPU segment GG ii,jj (GG ee ii,jj, GG mm ii,jj ) CPU GPU Partitioned fixed-priority preemptive task scheduling GPU segment GG ii,jj ee GPU kernel GG ii,jj 9

Server-based Approach GPU server task Handles GPU access requests from other tasks on their behalf Allows tasks to suspend whenever no CPU intervention is needed Highest priority Task ττ ii GPU Server Task Shared memory region GPU access segment Data Command Code 1 Request to the GPU server task ττ ii suspends GPU request queue 2 Order reqs. in task priority 3 Execute the highest priority GPU request GPU server suspends during CPU-inactive time GPU 10

Timing Behavior of GPU Server GPU server overhead εε Receiving a request and waking up the GPU server Checking the request queue Notifying the completion of the request Maximum handling time of all GPU requests of ττ ii by GPU server Total waiting time GPU segment length Server overhead (twice per segment) 11

Task Response Time with GPU Server Case 1: a task ττ ii and the GPU server are on the same CPU core GPU segment handling time Self-suspension by higher-priority tasks * Interference from the GPU server (e.g., GPU mem copy by server) Case 2: a task ττ ii and the GPU server are on different cores * J.-J. Chen et al. Many suspensions, many problems: A review of self-suspending tasks in real-time systems. Technical Report 854, Department of Computer Science, TU Dortmund, 2016. No interference from the GPU server (and misc. GPU operations) 12

Example under Synch-based Approach Response time of task ττ h : 9 * MPCP is used CPU Core 1 CPU Core 2 ττ hh ττ mm ττ ll GPU req. GPU req. Busy wait GPU req. Busy wait Priority boosting Busy wait Priority boosting Priority boosting GG ll GG hh GG mm GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 Normal exec. segment Misc. GPU operation GPU computation Busy waiting 13

Example under Server-based Approach GPU server ε Response time of task ττ h : 6+4ε CPU Core 1 ττ hh GPU req. Suspend ττ mm Suspend CPU Core 2 ττ ll GPU req. GPU req. Suspend ττ ll ττ hh ττ mm GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 Normal exec. segment Misc. GPU operation GPU computation Svr. overhead 14

Implementation SABRE Lite board (NXP i.mx6 SoC) Four ARM Cortex-A9 cores running at 1GHz Vivante GC2000 GPU OpenCL NXP Embedded Linux kernel version 3.14.52 Linux/RK version 1.6 patch * GPU server overhead εε Total of 44.97μμμμ delay * Linux/RK: http://rtml.ece.cmu.edu/redmine/projects/rk/ 15

Case Study Motivated by the software system of the CMU s self-driving car * Workzone Recognition Algorithm Two other GPU-using tasks and two CPU-only tasks * J. Wei et al. Towards a viable autonomous driving research platform. In IEEE Intelligent Vehicles Symposium (IV), 2013. J. Lee et al. Kernel-based traffic sign tracking to improve highway workzone recognition for reliable autonomous driving. In IEEE International Conference on Intelligent Transportation Systems (ITSC), 2013. 16

Task Execution Timeline Synchronization-based approach (using MPCP) Server-based approach Response time of a GPU-using task: 520.68 ms vs. 219.09 ms in the worst case 17

Schedulability Experiments Purpose: To explore the impact of the two approaches on task schedulability 10,000 randomly-generated tasksets 18

Results (1) Schedulability w.r.t. the percentage of GPU-using tasks Server-based approach performs better in most cases (with realistic parameters) 19

Results (2) Schedulability w.r.t. the GPU server overhead 44.97μμμμ was the overhead in our platform Server-based approach does not dominate synchronization-based approach 20

Conclusions Server-based GPU access control Motivated by the limitations of the synchronization-based approach Busy-waiting and long priority inversion Implementation with an acceptable overhead Significant improvement over the synch-based approach in most cases Future directions Improvement of analysis (worst-case waiting time calculation) Comparison with other synchronization protocols e.g., recent extension of FMLP+ allows self-suspension within critical sections GPU server has a central knowledge of all GPU requests Efficient co-scheduling of GPU kernels, GPU power management, etc. 21

Thank You A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D