S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION. Venugopala Madumbu, NVIDIA GTC D

Size: px
Start display at page:

Download "S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION. Venugopala Madumbu, NVIDIA GTC D"

Transcription

1 S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC D

2 ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD) High Compute Workloads Mapped to GPU 2

3 ADAS/AD Requirements & Challenges Real-Time Behavior Determinism Freedom from Interference Performance Maximum Throughput Minimal Latency Priority of Functionalities Multi-Core CPU GPU/DSP/HWA 3

4 ADAS/AD WORKLOADS Challenges Illustrated Scenario#1 Standalone Exec GL Workload X msec Scenario#2 Standalone Exec CUDA Workload Scenario#3 Concurrent Exec GL Workload CUDA Workload > (X+Y) msec If so, How to Achieve determinism Achieve Freedom from interference Prioritize one Workload over other While also having maximum throughput minimum latency CUDA Workload GL Workload Y msec Time Shared GPU Execution Y msec X msec 4

5 GPU IN TEGRA High Level Tegra SoC Block Diagram CPU Host GPU Engines Other Clients (ISP, Display, etc.) CPU submits job/work to GPU GPU runs asynchronously to CPU GPU Memory Interface Memory Controller GPU has its own hardware scheduler (Host) It switches between workloads without CPU involvement DRAM 5

6 GPU SCHEDULING Concepts Channel independent stream of work on the GPU Command Push Buffer Command buffer written by Software and read by Hardware Channel Switching Save/restore GPU state on a channel switch Semaphores/SyncPoints Synchronization mechanism for events within the GPU Time Slice How long a GPU executes commands of a channel before a channel switch Run-list An ordered list of channels that SW wants the GPU to execute 6

7 GPU SCHEDULING Timesharing by Channel Switching Channel switching occurs when any ONE of the following happens: Time slice expires Engine runs out of work (no more commands) Blocked on a semaphore Channel Switch time = Drain Time + Save/Restore time Preemption can reduce Channel Switch times drastically App1 App2 App3 App4 Timesliced Round-Robin GPU GPU Occupancy Time 7

8 GPU SCHEDULING Preemption 8

9 GPU SCHEDULING Channel Switching with Time Slice Scenarios Channel 1 Time slice Channel Channel 1 1 Time slice Channel Channel 1 1 Time slice Channel Switch Timeout Channel Switch Timeout 1. Channel finishes before time slice expires Context switch to next channel Channel Reset 2. Channel preemption Stop all commands in pipeline Wait for engines to idle Higher Context Switch time 3. Channel Reset Engine could not idle and context could not save before channel switch timeout Callback to notify kernel of channel reset event 9

10 CHALLENGE REVISTED How can we achieve both? Real-Time behavior: Determinism Freedom from Interference Priority of Functionalities Performance: Maximum Throughput Minimal Latency 10

11 GPU SYNCHRONIZATION & SCHEDULING Software Control 1. User Driver Level (GPU Synchronization Approach) Syncpoints/Semaphores for Synchronization Through EglStreams, EGLSync etc 2. Kernel Driver Level (GPU Priority Scheduling Approach) Run-List Engineering How long channel runs Order of Channel execution 11

12 GPU SYNCHRONIZATION APPROACH No Synchronization Case Kernel launch GPU Semaphore GPU Priority GPU Task GPU Task Latency due to Concurrent Execution GPU Task CPU CPU Task CPU Task CPU Task msec 12

13 GPU SYNCHRONIZATION APPROACH Synchronization on CPU: Not good for GPU Kernel launch GPU Semaphore GPU Priority GPU Task GPU Task GPU Task CPU CPU Task CPU Task CPU Task msec 13

14 GPU SYNCHRONIZATION APPROACH Synchronization on GPU: No Context Switches Kernel launch GPU Semaphore GPU Priority GPU Task Delayed Start GPU Task GPU Task Determinism Freedom from Interference Priority of Functionalities CPU CPU Task CPU Task CPU Task msec 14

15 GPU PRIORITY SCHEDULING APPROACH Hypothetical Example TASK PRIORITY FPS WORST CASE EXECUTION TIME (WCET) H1 High 60 9ms H1 M1 Medium 30 4ms M1 M2 Medium 30 4ms M2 L1 Low/Best Effort 30 10ms L1 15

16 GPU PRIORITY SCHEDULING APPROACH Engineered Run-list and Time Slice Ensuring FPS and Latency H1 H1 H1 (Max Exec Time = 9 ms) M1 (Max Exec Time = 4 ms) M1 M1 Time slice = 9 ms Time slice = 3 ms M2 M2 L1 M2 (Max Exec Time = 4 ms) Time slice = 3 ms L1 (Max Exec Time = 10 ms) Time slice = 1 ms Run-List Ensured not >16ms for 60fps operation Work on GPU Time 16

17 GPU PRIORITY SCHEDULING APPROACH Reduce Latency for GPU Work Completion Ensure timeslice is long enough to complete work Ensure work is continually submitted and also well ahead in time To Avoid GPU idle time Unnecessary context switches 17

18 GPU SCHEDULING Best Practices to Keep GPU Busy Submit work in advance So the GPU has some work to execute at any point of time Try to reduce/eliminate work dependencies Have contingency plan for work overload If feedback shows over budget, submit work few frames ahead and spread Plan for worst case scenario Deal with GPU reset case esp for the Low priority cases GL Robustness Extensions 18

19 CONCLUSION GPU Synchronization & Scheduling Approaches Real-Time behavior: Determinism Freedom from Interference Priority of Functionalities Performance: Maximum Throughput Minimal Latency 19

20 Scott Whitman, NVIDIA Vladislav Buzov, NVIDIA Amit Rao, NVIDIA Yogesh Kini, NVIDIA ACKNOWLEDGEMENTS GTC Instructor led Lab:: L7105 EGLSTREAMS : INTEROPERABILITY OF CAMERA, CUDA AND OPENGL 11 TH MAY :30-11:30AM LL21D 20

21 Q&A 21

22 THANK YOU

EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL. Debalina Bhattacharjee Sharan Ashwathnarayan

EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL. Debalina Bhattacharjee Sharan Ashwathnarayan 53023 - EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL Debalina Bhattacharjee Sharan Ashwathnarayan Tegra SOC and typical use-cases Why Interops EGLStream and Its Key Features Agenda Examples

More information

Running Multiple Workloads on a GPU A UX Oriented Approach. Yuval Sarna Graphics Software GameFly Streaming

Running Multiple Workloads on a GPU A UX Oriented Approach. Yuval Sarna Graphics Software GameFly Streaming Running Multiple Workloads on a GPU A UX Oriented Approach Yuval Sarna Graphics Software Expert @ GameFly Streaming Agenda Sharing the GPU We all like to Play Introduction to GPU Scheduling Proposed GPU

More information

Comp 310 Computer Systems and Organization

Comp 310 Computer Systems and Organization Comp 310 Computer Systems and Organization Lecture #9 Process Management (CPU Scheduling) 1 Prof. Joseph Vybihal Announcements Oct 16 Midterm exam (in class) In class review Oct 14 (½ class review) Ass#2

More information

Deadline-based Scheduling for GPU with Preemption Support

Deadline-based Scheduling for GPU with Preemption Support Deadline-based Scheduling for GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. University of Modena and Reggio Emilia NVIDIA Corp. 12/12/2018 RTSS 2018, NASHVILLE 1

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Aperiodic Servers (Issues)

Aperiodic Servers (Issues) Aperiodic Servers (Issues) Interference Causes more interference than simple periodic tasks Increased context switching Cache interference Accounting Context switch time Again, possibly more context switches

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele Building Real-Time Professional Visualization Solutions on GPUs Kristof Denolf Samuel Maroy Ronny Dewaele Page 2 Outline Barco s professional visualization solutions The need for performance portability

More information

Sync Points in the Intel Gfx Driver. Jesse Barnes Intel Open Source Technology Center

Sync Points in the Intel Gfx Driver. Jesse Barnes Intel Open Source Technology Center Sync Points in the Intel Gfx Driver Jesse Barnes Intel Open Source Technology Center 1 Agenda History and other implementations Other I/O layers - block device ordering NV_fence, ARB_sync EGL_native_fence_sync,

More information

S CUDA on Xavier

S CUDA on Xavier S8868 - CUDA on Xavier Anshuman Bhat CUDA Product Manager Saikat Dasadhikari CUDA Engineering 29 th March 2018 1 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

More information

GTC Interaction Simplified. Gesture Recognition Everywhere: Gesture Solutions on Tegra

GTC Interaction Simplified. Gesture Recognition Everywhere: Gesture Solutions on Tegra GTC 2013 Interaction Simplified Gesture Recognition Everywhere: Gesture Solutions on Tegra eyesight at a Glance Touch-free technology providing an enhanced user experience. Easy and intuitive control

More information

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016 NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING September 13, 2016 AI FOR AUTONOMOUS DRIVING MAPPING KALDI LOCALIZATION DRIVENET Training on DGX-1 NVIDIA DGX-1 NVIDIA DRIVE PX 2 Driving with DriveWorks

More information

CS2506 Quick Revision

CS2506 Quick Revision CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process

More information

Benchmarking Real-World In-Vehicle Applications

Benchmarking Real-World In-Vehicle Applications Benchmarking Real-World In-Vehicle Applications NVIDIA GTC 2015-03-18 m y c a b l e GmbH Michael Carstens-Behrens Gartenstraße 10 24534 Neumuenster, Germany +49 4321 559 56-55 +49 4321 559 56-10 mcb@mycable.de

More information

THE LEADER IN VISUAL COMPUTING

THE LEADER IN VISUAL COMPUTING MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning

More information

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ OVERVIEW Motivation Tools of the trade Multi-GPU driver functions Multi-GPU programming functions Multi threaded

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

NuttX Realtime Programming

NuttX Realtime Programming NuttX RTOS NuttX Realtime Programming Gregory Nutt Overview Interrupts Cooperative Scheduling Tasks Work Queues Realtime Schedulers Real Time == == Deterministic Response Latency Stimulus Response Deadline

More information

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms Operating System Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts Scheduling Criteria Scheduling Algorithms OS Process Review Multicore Programming Multithreading Models Thread Libraries Implicit

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Vulkan Timeline Semaphores

Vulkan Timeline Semaphores Vulkan line Semaphores Jason Ekstrand September 2018 Copyright 2018 The Khronos Group Inc. - Page 1 Current Status of VkSemaphore Current VkSemaphores require a strict signal, wait, signal, wait pattern

More information

CS A331 Programming Language Concepts

CS A331 Programming Language Concepts CS A331 Programming Language Concepts Lecture 12 Alternative Language Examples (General Concurrency Issues and Concepts) March 30, 2014 Sam Siewert Major Concepts Concurrent Processing Processes, Tasks,

More information

Multiprocessor and Real- Time Scheduling. Chapter 10

Multiprocessor and Real- Time Scheduling. Chapter 10 Multiprocessor and Real- Time Scheduling Chapter 10 Classifications of Multiprocessor Loosely coupled multiprocessor each processor has its own memory and I/O channels Functionally specialized processors

More information

Preview. Process Scheduler. Process Scheduling Algorithms for Batch System. Process Scheduling Algorithms for Interactive System

Preview. Process Scheduler. Process Scheduling Algorithms for Batch System. Process Scheduling Algorithms for Interactive System Preview Process Scheduler Short Term Scheduler Long Term Scheduler Process Scheduling Algorithms for Batch System First Come First Serve Shortest Job First Shortest Remaining Job First Process Scheduling

More information

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three level scheduling 2 1 Types of Scheduling 3 Long- and Medium-Term Schedulers Long-term scheduler Determines which programs

More information

Uniprocessor Scheduling. Chapter 9

Uniprocessor Scheduling. Chapter 9 Uniprocessor Scheduling Chapter 9 1 Aim of Scheduling Assign processes to be executed by the processor(s) Response time Throughput Processor efficiency 2 3 4 Long-Term Scheduling Determines which programs

More information

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW 2 SO FAR talked about in-kernel building blocks: threads memory IPC drivers

More information

Reference Model and Scheduling Policies for Real-Time Systems

Reference Model and Scheduling Policies for Real-Time Systems ESG Seminar p.1/42 Reference Model and Scheduling Policies for Real-Time Systems Mayank Agarwal and Ankit Mathur Dept. of Computer Science and Engineering, Indian Institute of Technology Delhi ESG Seminar

More information

CMPS 111 Spring 2003 Midterm Exam May 8, Name: ID:

CMPS 111 Spring 2003 Midterm Exam May 8, Name: ID: CMPS 111 Spring 2003 Midterm Exam May 8, 2003 Name: ID: This is a closed note, closed book exam. There are 20 multiple choice questions and 5 short answer questions. Plan your time accordingly. Part I:

More information

CS 856 Latency in Communication Systems

CS 856 Latency in Communication Systems CS 856 Latency in Communication Systems Winter 2010 Latency Challenges CS 856, Winter 2010, Latency Challenges 1 Overview Sources of Latency low-level mechanisms services Application Requirements Latency

More information

DEEP DIVE INTO DYNAMIC PARALLELISM

DEEP DIVE INTO DYNAMIC PARALLELISM April 4-7, 2016 Silicon Valley DEEP DIVE INTO DYNAMIC PARALLELISM SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA 1 OVERVIEW AND INTRODUCTION 2 WHAT IS DYNAMIC PARALLELISM? The ability

More information

Course Syllabus. Operating Systems

Course Syllabus. Operating Systems Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling

More information

A Server-based Approach for Predictable GPU Access Control

A Server-based Approach for Predictable GPU Access Control A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D Benefits

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Midterm Review Midterm in class on Monday Covers material through scheduling and deadlock Based upon lecture material and modules of the book indicated on

More information

Project 2: CPU Scheduling Simulator

Project 2: CPU Scheduling Simulator Project 2: CPU Scheduling Simulator CSCI 442, Spring 2017 Assigned Date: March 2, 2017 Intermediate Deliverable 1 Due: March 10, 2017 @ 11:59pm Intermediate Deliverable 2 Due: March 24, 2017 @ 11:59pm

More information

GstShark profiling: a real-life example. Michael Grüner - David Soto -

GstShark profiling: a real-life example. Michael Grüner - David Soto - GstShark profiling: a real-life example Michael Grüner - michael.gruner@ridgerun.com David Soto - david.soto@ridgerun.com Introduction Michael Grüner Technical Lead at RidgeRun Digital signal processing

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Tasks. Task Implementation and management

Tasks. Task Implementation and management Tasks Task Implementation and management Tasks Vocab Absolute time - real world time Relative time - time referenced to some event Interval - any slice of time characterized by start & end times Duration

More information

Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms

Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms Consistent * Complete * Well Documented * Easy to Reuse * Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms Waqar Ali University of Kansas Lawrence, USA wali@ku.edu Heechul Yun University

More information

Exam TI2720-C/TI2725-C Embedded Software

Exam TI2720-C/TI2725-C Embedded Software Exam TI2720-C/TI2725-C Embedded Software Wednesday April 16 2014 (18.30-21.30) Koen Langendoen In order to avoid misunderstanding on the syntactical correctness of code fragments in this examination, we

More information

Introduction to Operating System. Dr. Aarti Singh Professor MMICT&BM MMU

Introduction to Operating System. Dr. Aarti Singh Professor MMICT&BM MMU Introduction to Operating System Dr. Aarti Singh Professor MMICT&BM MMU Contents Today's Topic: Introduction to Operating Systems We will learn 1. What is Operating System? 2. What OS does? 3. Structure

More information

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5 OPERATING SYSTEMS CS3502 Spring 2018 Processor Scheduling Chapter 5 Goals of Processor Scheduling Scheduling is the sharing of the CPU among the processes in the ready queue The critical activities are:

More information

Efficient Video Processing on Embedded GPU

Efficient Video Processing on Embedded GPU Efficient Video Processing on Embedded GPU Tobias Kammacher Armin Weiss Matthias Frei Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of Applied Sciences (ZHAW)

More information

Properties of Processes

Properties of Processes CPU Scheduling Properties of Processes CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait. CPU burst distribution: CPU Scheduler Selects from among the processes that

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Non-Blocking Write Protocol NBW:

Non-Blocking Write Protocol NBW: Non-Blocking Write Protocol NBW: A Solution to a Real-Time Synchronization Problem By: Hermann Kopetz and Johannes Reisinger Presented By: Jonathan Labin March 8 th 2005 Classic Mutual Exclusion Scenario

More information

Real-Time Architectures 2003/2004. Resource Reservation. Description. Resource reservation. Reinder J. Bril

Real-Time Architectures 2003/2004. Resource Reservation. Description. Resource reservation. Reinder J. Bril Real-Time Architectures 2003/2004 Resource reservation Reinder J. Bril 03-05-2004 1 Resource Reservation Description Example Application domains Some issues Concluding remark 2 Description Resource reservation

More information

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided

More information

Summary. Ø Introduction Ø Project Goals Ø A Real-time Data Streaming Framework Ø Evaluations Ø Conclusions

Summary. Ø Introduction Ø Project Goals Ø A Real-time Data Streaming Framework Ø Evaluations Ø Conclusions Summary Ø Introduction Ø Project Goals Ø A Real-time Data Streaming Framework Ø Evaluations Ø Conclusions Introduction Java 8 has introduced Streams and lambda expressions to support the efficient processing

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information

Lecture 2 Process Management

Lecture 2 Process Management Lecture 2 Process Management Process Concept An operating system executes a variety of programs: Batch system jobs Time-shared systems user programs or tasks The terms job and process may be interchangeable

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Integrating Concurrency Control and Energy Management in Device Drivers. Chenyang Lu

Integrating Concurrency Control and Energy Management in Device Drivers. Chenyang Lu Integrating Concurrency Control and Energy Management in Device Drivers Chenyang Lu Overview Ø Concurrency Control: q Concurrency of I/O operations alone, not of threads in general q Synchronous vs. Asynchronous

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke CSCI-GA.2250-001 Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke frankeh@cs.nyu.edu Processes Vs Threads The unit of dispatching is referred to as a thread or lightweight

More information

CMPS 111 Spring 2013 Prof. Scott A. Brandt Midterm Examination May 6, Name: ID:

CMPS 111 Spring 2013 Prof. Scott A. Brandt Midterm Examination May 6, Name: ID: CMPS 111 Spring 2013 Prof. Scott A. Brandt Midterm Examination May 6, 2013 Name: ID: This is a closed note, closed book exam. There are 23 multiple choice questions, 6 short answer questions. Plan your

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

REAL-TIME OVERVIEW SO FAR MICHAEL ROITZSCH. talked about in-kernel building blocks:

REAL-TIME OVERVIEW SO FAR MICHAEL ROITZSCH. talked about in-kernel building blocks: Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW 2 SO FAR talked about in-kernel building blocks: threads memory IPC drivers

More information

Interrupts and Time. Real-Time Systems, Lecture 5. Martina Maggio 28 January Lund University, Department of Automatic Control

Interrupts and Time. Real-Time Systems, Lecture 5. Martina Maggio 28 January Lund University, Department of Automatic Control Interrupts and Time Real-Time Systems, Lecture 5 Martina Maggio 28 January 2016 Lund University, Department of Automatic Control Content [Real-Time Control System: Chapter 5] 1. Interrupts 2. Clock Interrupts

More information

CS 370 Operating Systems

CS 370 Operating Systems NAME S.ID. # CS 370 Operating Systems Mid-term Example Instructions: The exam time is 50 minutes. CLOSED BOOK. 1. [24 pts] Multiple choice. Check one. a. Multiprogramming is: An executable program that

More information

Uniprocessor Scheduling. Aim of Scheduling

Uniprocessor Scheduling. Aim of Scheduling Uniprocessor Scheduling Chapter 9 Aim of Scheduling Response time Throughput Processor efficiency Types of Scheduling Long-Term Scheduling Determines which programs are admitted to the system for processing

More information

Uniprocessor Scheduling. Aim of Scheduling. Types of Scheduling. Long-Term Scheduling. Chapter 9. Response time Throughput Processor efficiency

Uniprocessor Scheduling. Aim of Scheduling. Types of Scheduling. Long-Term Scheduling. Chapter 9. Response time Throughput Processor efficiency Uniprocessor Scheduling Chapter 9 Aim of Scheduling Response time Throughput Processor efficiency Types of Scheduling Long-Term Scheduling Determines which programs are admitted to the system for processing

More information

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst Operating Systems CMPSCI 377 Spring 2017 Mark Corner University of Massachusetts Amherst Multilevel Feedback Queues (MLFQ) Multilevel feedback queues use past behavior to predict the future and assign

More information

Strong Scaling for Molecular Dynamics Applications

Strong Scaling for Molecular Dynamics Applications Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics

More information

Modeling Software with SystemC 3.0

Modeling Software with SystemC 3.0 Modeling Software with SystemC 3.0 Thorsten Grötker Synopsys, Inc. 6 th European SystemC Users Group Meeting Stresa, Italy, October 22, 2002 Agenda Roadmap Why Software Modeling? Today: What works and

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Boosting Quasi-Asynchronous I/Os (QASIOs)

Boosting Quasi-Asynchronous I/Os (QASIOs) Boosting Quasi-hronous s (QASIOs) Joint work with Daeho Jeong and Youngjae Lee Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu The Problem 2 Why?

More information

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria A Predictable RTOS Mantis Cheng Department of Computer Science University of Victoria Outline I. Analysis of Timeliness Requirements II. Analysis of IO Requirements III. Time in Scheduling IV. IO in Scheduling

More information

CPU Scheduling Algorithms

CPU Scheduling Algorithms CPU Scheduling Algorithms Notice: The slides for this lecture have been largely based on those accompanying the textbook Operating Systems Concepts with Java, by Silberschatz, Galvin, and Gagne (2007).

More information

Real-Time Operating Systems Issues. Realtime Scheduling in SunOS 5.0

Real-Time Operating Systems Issues. Realtime Scheduling in SunOS 5.0 Real-Time Operating Systems Issues Example of a real-time capable OS: Solaris. S. Khanna, M. Sebree, J.Zolnowsky. Realtime Scheduling in SunOS 5.0. USENIX - Winter 92. Problems with the design of general-purpose

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

OVERVIEW. Last Week: But if frequency of high priority task increases temporarily, system may encounter overload: Today: Slide 1. Slide 3.

OVERVIEW. Last Week: But if frequency of high priority task increases temporarily, system may encounter overload: Today: Slide 1. Slide 3. OVERVIEW Last Week: Scheduling Algorithms Real-time systems Today: But if frequency of high priority task increases temporarily, system may encounter overload: Yet another real-time scheduling algorithm

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

An Overview of MIPS Multi-Threading. White Paper

An Overview of MIPS Multi-Threading. White Paper Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary

More information

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Uniprocessor Scheduling

Uniprocessor Scheduling Uniprocessor Scheduling Chapter 9 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall CPU- and I/O-bound processes

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

Processes and Threads

Processes and Threads Processes and Threads Giuseppe Anastasi g.anastasi@iet.unipi.it Pervasive Computing & Networking Lab. () Dept. of Information Engineering, University of Pisa Based on original slides by Silberschatz, Galvin

More information

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG 1 WHO ARE THOSE GUYS Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling

More information

TEGRA LINUX DRIVER PACKAGE R23.2

TEGRA LINUX DRIVER PACKAGE R23.2 TEGRA LINUX DRIVER PACKAGE R23.2 RN_05071-R23 February 25, 2016 Advance Information Subject to Change Release Notes RN_05071-R23 TABLE OF CONTENTS 1.0 ABOUT THIS RELEASE... 3 1.1 What s New... 3 1.2 Login

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

THREADS ADMINISTRIVIA RECAP ALTERNATIVE 2 EXERCISES PAPER READING MICHAEL ROITZSCH 2

THREADS ADMINISTRIVIA RECAP ALTERNATIVE 2 EXERCISES PAPER READING MICHAEL ROITZSCH 2 Department of Computer Science Institute for System Architecture, Operating Systems Group THREADS ADMINISTRIVIA MICHAEL ROITZSCH 2 EXERCISES due to date and room clashes we have to divert from our regular

More information

Lecture 10. Efficient Host-Device Data Transfer

Lecture 10. Efficient Host-Device Data Transfer 1 Lecture 10 Efficient Host-Device Data fer 2 Objective To learn the important concepts involved in copying (transferring) data between host and device System Interconnect Direct Memory Access Pinned memory

More information