Making OpenVX Really Real Time

Size: px
Start display at page:

Download "Making OpenVX Really Real Time"

Transcription

1 Making OpenVX Really Real Time Ming Yang 1, Tanya Amert 1, Kecheng Yang 1,2, Nathan Otterness 1, James H. Anderson 1, F. Donelson Smith 1, and Shige Wang 3 1The University of North Carolina at Chapel Hill 2Texas State University 3General Motors Research

2 700 ms

3

4 A new approach for graph scheduling

5 Shorter response time + Less capacity loss

6 1. State of the art 2. Our approach 3. Future work!6

7 Example OpenVX Graph Graph-based architecture Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing Portability to diverse hardware Application Application GPU FPGA DSP Does OpenVX really target real-time processing?!7 Source:

8 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities Example OpenVX Graph Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing!8 Source:

9 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D!9 Source:

10 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D A B C D Monolithic scheduling A Time!10 Source:

11 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] A C B D Task A: A A Task B: B B Task C: C C Task D: D D Coarse-grained scheduling Time!11

12 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] Remaining problems: 1. More parallelism to be explored 2. Suspension-oblivious analysis was applied and causes capacity loss.!12

13 Fine-Grained Scheduling This Work

14 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!14

15 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!15

16 A C B Suspension for GPU execution D Task A: Task B: Task C: Task D: Coarse-Grained Scheduling Time A C E F G D Task A: Task E: Task F: GPU execution Task G: Task C: Task D: Time Fine-Grained Scheduling!16

17 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!17

18 Deriving Response-Time Bounds for a DAG* Step 1: Schedule the nodes as sporadic tasks Step 2: Compute bounds for every node Step 3: Sum the bounds of nodes on the critical path * C. Liu and J. Anderson, Supporting Soft Real-Time DAG-based Systems on Multiprocessors with No Utilization Loss, in RTSS, 2013.!18

19 Deriving Response-Time Bounds for a DAG A C B E F D!19

20 Deriving Response-Time Bounds for a DAG A C B E F D!20

21 Deriving Response-Time Bounds for a DAG CPU A B C F D GPU E Need a response-time bound analysis for GPU tasks!21

22 A system model of GPU Tasks Per-block worst-case workload τ i = (C i, T i, B i, H i ) Period Number of blocks Number of threads per block (or block size) SM1 SM C 1 T 1 H 1 = 1024 B Time τ 1 = (3076,6,2,1024)!22

23 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples.!23

24 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. Releases: Without intra-task parallelism: With intra-task parallelism: Time!24

25 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. R k SM1 2. We then bound the unfinished workload from jobs released at or before r k,j. SM0 r k,j!25 τk,j Time 3. We prove the job finishes before r k,j + R k.

26 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!26

27 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) vxhogcells vxhogcells vxhogcells Node Node Node vxhogfeature vxhogfeature vxhogfeatures snode snode Node Resize Image Resize Image Resize Image Compute Compute Compute Gradients Gradients Gradients Compute Compute Orientation Orientation Compute Orientation Histograms Histograms Histograms Normalize Normalize Normalize Orientation Orientation Orientation Histograms Histograms Histograms CPU+GPU Execution (Coarse-Grained) GPU Execution (Fine-Grained)!27

28 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) 6 instances 33 ms period 30,000 samples Platform: NVIDIA Titan V GPU + Two eight-core Intel CPUs. Schedulers: G-EDF, G-FL (fair-lateness)!28

29 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling % of samples 50% samples have response time less than 60 ms Left is better Time!29

30 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !30

31 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !31

32 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!32

33 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!33

34 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!34

35 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A!35

36 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!36

37 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!37

38 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness An alert driver takes 700 ms to react. [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!38 [3] [3]

39 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling An alert driver takes 700 ms to react. [1] [2] Fair-lateness-based scheduler is beneficial as it reduced node response times by up to 9.9%. Overheads of supporting fine-grained scheduling was 14.15%.!39 FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A [3] [3]

40 Conclusions 1. Fine-grained scheduling 2. Response-time bounds analysis for GPU tasks 3. Case study!40

41 Future Work 1. Cycles in the graph 2. Other resource constraints 3. Schedulability studies!41

42 Thanks!

Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs

Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs Joshua Bakita Department of Computer Science, University of North Carolina at Chapel Hill 14th Annual Workshop on Operating

More information

Empirical Approximation and Impact on Schedulability

Empirical Approximation and Impact on Schedulability Cache-Related Preemption and Migration Delays: Empirical Approximation and Impact on Schedulability OSPERT 2010, Brussels July 6, 2010 Andrea Bastoni University of Rome Tor Vergata Björn B. Brandenburg

More information

Nested Multiprocessor Real-Time Locking with Improved Blocking

Nested Multiprocessor Real-Time Locking with Improved Blocking Nested Multiprocessor Real-Time Locking with Improved Blocking Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract Existing multiprocessor

More information

A Server-based Approach for Predictable GPU Access Control

A Server-based Approach for Predictable GPU Access Control A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D Benefits

More information

Inferring Scheduling Policies of an Embedded CUDA GPU

Inferring Scheduling Policies of an Embedded CUDA GPU Inferring Scheduling Policies of an Embedded CUDA GPU Nathan Otterness, Ming Yang, Tanya Amert, James H. Anderson, F. Donelson Smith 1 Department of Computer Science, University of North Carolina at Chapel

More information

Multiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous

Multiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous Multiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous Kecheng Yang Department of Computer Science, University of North Carolina at Chapel Hill Abstract In this project, we focus

More information

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions *, Alessandro Biondi *, Geoffrey Nelissen, and Giorgio Buttazzo * * ReTiS Lab, Scuola Superiore Sant Anna, Pisa, Italy CISTER,

More information

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

The case for limited-preemptive scheduling in GPUs for real-time systems

The case for limited-preemptive scheduling in GPUs for real-time systems The case for limited-preemptive scheduling in GPUs for real-time systems Roy Spliet Robert Mullins (first.last@cst.cam.ac.uk) Department of Computer Science and Technology University of Cambridge GPUs

More information

Blocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks

Blocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks On Spin Locks in AUTOSAR: Blocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks Alexander Wieder and Björn Brandenburg MPI-SWS RTSS 2013 12/04/2013 Vancouver, Canada Motivation: AUTOSAR:

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Scheduling of Parallel Real-time DAG Tasks on Multiprocessor Systems

Scheduling of Parallel Real-time DAG Tasks on Multiprocessor Systems Scheduling of Parallel Real-time DAG Tasks on Multiprocessor Systems Laurent George ESIEE Paris Journée du groupe de travail OVSTR - 23 mai 2016 Université Paris-Est, LRT Team at LIGM 1/53 CONTEXT: REAL-TIME

More information

Standards for Vision Processing and Neural Networks

Standards for Vision Processing and Neural Networks Copyright Khronos Group 2017 - Page 1 Standards for Vision Processing and Neural Networks Radhakrishna Giduthuri, AMD radha.giduthuri@ieee.org Agenda Why we need a standard? Khronos NNEF Khronos OpenVX

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Fixed-Priority Multiprocessor Scheduling

Fixed-Priority Multiprocessor Scheduling Fixed-Priority Multiprocessor Scheduling Real-time Systems N periodic tasks (of different rates/periods) r i T i C i T i C C J i Ji i ij i r i r i r i Utilization/workload: How to schedule the jobs to

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

Fixed-Priority Multiprocessor Scheduling. Real-time Systems. N periodic tasks (of different rates/periods) i Ji C J. 2 i. ij 3

Fixed-Priority Multiprocessor Scheduling. Real-time Systems. N periodic tasks (of different rates/periods) i Ji C J. 2 i. ij 3 0//0 Fixed-Priority Multiprocessor Scheduling Real-time Systems N periodic tasks (of different rates/periods) r i T i C i T i C C J i Ji i ij i r i r i r i Utilization/workload: How to schedule the jobs

More information

On Latency Management in Time-Shared Operating Systems *

On Latency Management in Time-Shared Operating Systems * On Latency Management in Time-Shared Operating Systems * Kevin Jeffay University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC 27599-3175 jeffay@cs.unc.edu Abstract: The

More information

Reducing Response-Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms

Reducing Response-Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms Reducing Response-Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms Kecheng Yang, Ming Yang, and James H. Anderson Department of Computer Science, University of North Carolina

More information

Parallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car

Parallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car Parallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car Junsung Kim, Hyoseung Kim, Karthik Lakshmanan and Raj Rajkumar Carnegie Mellon University Google 2 CMU s Autonomous

More information

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA How Can You Gain Access to GPU Power? 3

More information

GPU 101. Mike Bailey. Oregon State University

GPU 101. Mike Bailey. Oregon State University 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA 1 How Can You Gain Access to GPU Power?

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang 1, Nathan Otterness 2, Tanya Amert 3, Joshua Bakita 4, James H. Anderson 5, and F. Donelson Smith 6 1 Department

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

arxiv: v2 [cs.ds] 22 Jun 2016

arxiv: v2 [cs.ds] 22 Jun 2016 Federated Scheduling Admits No Constant Speedup Factors for Constrained-Deadline DAG Task Systems Jian-Jia Chen Department of Informatics, TU Dortmund University, Germany arxiv:1510.07254v2 [cs.ds] 22

More information

Applying OpenCL. IWOCL, May Andrew Richards

Applying OpenCL. IWOCL, May Andrew Richards Applying OpenCL IWOCL, May 2017 Andrew Richards The next generation of software will not be built on CPUs 2 On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance - Daniel

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Replica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols

Replica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols Replica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols Bryan C. Ward, Glenn A. Elliott, and James H. Anderson Department of Computer Science, University of North

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

TVA: A DoS-limiting Network Architecture L

TVA: A DoS-limiting Network Architecture L DoS is not even close to be solved : A DoS-limiting Network Architecture L Xiaowei Yang (UC Irvine) David Wetherall (Univ. of Washington) Thomas Anderson (Univ. of Washington) 1 n Address validation is

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Hybrid EDF Packet Scheduling for Real-Time Distributed Systems

Hybrid EDF Packet Scheduling for Real-Time Distributed Systems Hybrid EDF Packet Scheduling for Real-Time Distributed Systems Tao Qian 1, Frank Mueller 1, Yufeng Xin 2 1 North Carolina State University, USA 2 RENCI, University of North Carolina at Chapel Hill This

More information

A MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS

A MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS A MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS 1 Santosh Devi, 2 Radhika, 3 Parminder Singh 1,2 Student M.Tech (CSE), 3 Assistant Professor Lovely Professional University, Phagwara,

More information

A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications

A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications ECRTS 13 July 12, 2013 Björn B. bbb@mpi-sws.org A Rhetorical Question On uniprocessors, why do we use the

More information

A Server-based Approach for Predictable GPU Access Control

A Server-based Approach for Predictable GPU Access Control A Server-based Approach for Predictable GPU Access Control Hyoseung Kim 1, Pratyush Patel 2, Shige Wang 3, and Ragunathan (Raj) Rajkumar 2 1 University of California, Riverside 2 Carnegie Mellon University

More information

Supporting Nested Locking in Multiprocessor Real-Time Systems

Supporting Nested Locking in Multiprocessor Real-Time Systems Supporting Nested Locking in Multiprocessor Real-Time Systems Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract This paper presents

More information

Real-Time Architectures 2004/2005

Real-Time Architectures 2004/2005 Real-Time Architectures 2004/2005 Scheduling Analysis I Introduction & Basic scheduling analysis Reinder J. Bril 08-04-2005 1 Overview Algorithm and problem classes Simple, periodic taskset problem statement

More information

Multimedia-Systems. Operating Systems. Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg

Multimedia-Systems. Operating Systems. Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg Multimedia-Systems Operating Systems Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg WE: University of Mannheim, Dept. of Computer Science Praktische

More information

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA Etienne BORDE Laurent PAUTET December 13, 2018 1/28 Outline Research Context Problem Statement Scheduling MC-DAGs

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

UNIT -3 PROCESS AND OPERATING SYSTEMS 2marks 1. Define Process? Process is a computational unit that processes on a CPU under the control of a scheduling kernel of an OS. It has a process structure, called

More information

Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group. vs. REAL-TIME SYSTEMS MICHAEL ROITZSCH

Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group. vs. REAL-TIME SYSTEMS MICHAEL ROITZSCH Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group REAL-TIME vs. SYSTEMS MICHAEL ROITZSCH DEFINITION system whose quality depends on the functional correctness of computations

More information

Reference Model and Scheduling Policies for Real-Time Systems

Reference Model and Scheduling Policies for Real-Time Systems ESG Seminar p.1/42 Reference Model and Scheduling Policies for Real-Time Systems Mayank Agarwal and Ankit Mathur Dept. of Computer Science and Engineering, Indian Institute of Technology Delhi ESG Seminar

More information

An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems

An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein : Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein What do we do? Enable efficient file I/O for s Why? Support diverse

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three level scheduling 2 1 Types of Scheduling 3 Long- and Medium-Term Schedulers Long-term scheduler Determines which programs

More information

Pricing Intra-Datacenter Networks with

Pricing Intra-Datacenter Networks with Pricing Intra-Datacenter Networks with Over-Committed Bandwidth Guarantee Jian Guo 1, Fangming Liu 1, Tao Wang 1, and John C.S. Lui 2 1 Cloud Datacenter & Green Computing/Communications Research Group

More information

A Comparative Study of the Realization of Rate-Based Computing Services in General Purpose Operating Systems

A Comparative Study of the Realization of Rate-Based Computing Services in General Purpose Operating Systems A technology for real-time computing on the desktop A Comparative Study of the Realization of Rate-Based Computing Services in General Purpose Operating Systems Kevin Jeffay Department of Computer Science

More information

Time Triggered and Event Triggered; Off-line Scheduling

Time Triggered and Event Triggered; Off-line Scheduling Time Triggered and Event Triggered; Off-line Scheduling Real-Time Architectures -TUe Gerhard Fohler 2004 Mälardalen University, Sweden gerhard.fohler@mdh.se Real-time: TT and ET Gerhard Fohler 2004 1 Activation

More information

Real-Time Internet of Things

Real-Time Internet of Things Real-Time Internet of Things Chenyang Lu Cyber-Physical Systems Laboratory h7p://www.cse.wustl.edu/~lu/ Internet of Things Ø Convergence of q Miniaturized devices: integrate processor, sensors and radios.

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

Globally Scheduled Real-Time Multiprocessor Systems with GPUs

Globally Scheduled Real-Time Multiprocessor Systems with GPUs Globally Scheduled Real-Time Multiprocessor Systems with GPUs Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract Graphics processing

More information

Practice Exercises 305

Practice Exercises 305 Practice Exercises 305 The FCFS algorithm is nonpreemptive; the RR algorithm is preemptive. The SJF and priority algorithms may be either preemptive or nonpreemptive. Multilevel queue algorithms allow

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

Real-Time Systems. Hard Real-Time Multiprocessor Scheduling

Real-Time Systems. Hard Real-Time Multiprocessor Scheduling Real-Time Systems Hard Real-Time Multiprocessor Scheduling Marcus Völp WS 2015/16 Outline Introduction Terminology, Notation and Assumptions Anomalies + Impossibility Results Partitioned Scheduling (no

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Comparison of scheduling in RTLinux and QNX. Andreas Lindqvist, Tommy Persson,

Comparison of scheduling in RTLinux and QNX. Andreas Lindqvist, Tommy Persson, Comparison of scheduling in RTLinux and QNX Andreas Lindqvist, andli299@student.liu.se Tommy Persson, tompe015@student.liu.se 19 November 2006 Abstract The purpose of this report was to learn more about

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

EECS 570 Lecture 25 Genomics and Hardware Multi-threading

EECS 570 Lecture 25 Genomics and Hardware Multi-threading Lecture 25 Genomics and Hardware Multi-threading Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin,

More information

Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities

Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities Arpan Gujarati, Felipe Cerqueira, and Björn Brandenburg Multiprocessor real-time scheduling theory Global

More information

Implicit GPU Synchronization: A Barrier to Real-Time CUDA Workloads

Implicit GPU Synchronization: A Barrier to Real-Time CUDA Workloads Implicit GPU Synchronization: A Barrier to Real-Time CUDA Workloads Nathan Otterness, Ming Yang, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith Department of Computer Science, University

More information

A Server-based Approach for Predictable GPU Access with Improved Analysis

A Server-based Approach for Predictable GPU Access with Improved Analysis A Server-based Approach for Predictable GPU Access with Improved Analysis Hyoseung Kim 1, Pratyush Patel 2, Shige Wang 3, and Ragunathan (Raj) Rajkumar 4 1 University of California, Riverside, hyoseung@ucr.edu

More information

Scheduling. Jesus Labarta

Scheduling. Jesus Labarta Scheduling Jesus Labarta Scheduling Applications submitted to system Resources x Time Resources: Processors Memory Objective Maximize resource utilization Maximize throughput Minimize response time Not

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Multiprocessor Synchronization and Hierarchical Scheduling

Multiprocessor Synchronization and Hierarchical Scheduling Multiprocessor Synchronization and Hierarchical Scheduling Farhang Nemati, Moris Behnam, Thomas Nolte Mälardalen Real-Time Research Centre P.O. Box 883, SE-721 23 Västerås, Sweden farhang.nemati@mdh.se

More information

CSC630/COS781: Parallel & Distributed Computing

CSC630/COS781: Parallel & Distributed Computing CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Deadline-based Scheduling for GPU with Preemption Support

Deadline-based Scheduling for GPU with Preemption Support Deadline-based Scheduling for GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. University of Modena and Reggio Emilia NVIDIA Corp. 12/12/2018 RTSS 2018, NASHVILLE 1

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Supporting Nested Locking in Multiprocessor Real-Time Systems

Supporting Nested Locking in Multiprocessor Real-Time Systems Supporting Nested Locking in Multiprocessor Real-Time Systems Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract This paper presents

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Multi-Resource Real-Time Reader/Writer Locks for Multiprocessors

Multi-Resource Real-Time Reader/Writer Locks for Multiprocessors Multi-Resource Real-Time Reader/Writer Locks for Multiprocessors Bryan C. Ward and James H. Anderson Dept. of Computer Science, The University of North Carolina at Chapel Hill Abstract A fine-grained locking

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

The Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches

The Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches The Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

Real Time Operating Systems and Middleware

Real Time Operating Systems and Middleware Real Time Operating Systems and Middleware Introduction to Real-Time Systems Luca Abeni abeni@disi.unitn.it Credits: Luigi Palopoli, Giuseppe Lipari, Marco Di Natale, and Giorgio Buttazzo Scuola Superiore

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Traffic Sign Localization and Classification Methods: An Overview

Traffic Sign Localization and Classification Methods: An Overview Traffic Sign Localization and Classification Methods: An Overview Ivan Filković University of Zagreb Faculty of Electrical Engineering and Computing Department of Electronics, Microelectronics, Computer

More information