Making OpenVX Really Real Time
|
|
- Stella Hodges
- 5 years ago
- Views:
Transcription
1 Making OpenVX Really Real Time Ming Yang 1, Tanya Amert 1, Kecheng Yang 1,2, Nathan Otterness 1, James H. Anderson 1, F. Donelson Smith 1, and Shige Wang 3 1The University of North Carolina at Chapel Hill 2Texas State University 3General Motors Research
2 700 ms
3
4 A new approach for graph scheduling
5 Shorter response time + Less capacity loss
6 1. State of the art 2. Our approach 3. Future work!6
7 Example OpenVX Graph Graph-based architecture Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing Portability to diverse hardware Application Application GPU FPGA DSP Does OpenVX really target real-time processing?!7 Source:
8 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities Example OpenVX Graph Native Camera Control OpenVX Node OpenVX Node OpenVX Node OpenVX Node Downstream Application Processing!8 Source:
9 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D!9 Source:
10 Does OpenVX really target real-time processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities A C B D A B C D Monolithic scheduling A Time!10 Source:
11 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] A C B D Task A: A A Task B: B B Task C: C C Task D: D D Coarse-grained scheduling Time!11
12 Prior Work Coarse-grained scheduling OpenVX nodes = schedulable entities [23, 51] Remaining problems: 1. More parallelism to be explored 2. Suspension-oblivious analysis was applied and causes capacity loss.!12
13 Fine-Grained Scheduling This Work
14 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!14
15 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!15
16 A C B Suspension for GPU execution D Task A: Task B: Task C: Task D: Coarse-Grained Scheduling Time A C E F G D Task A: Task E: Task F: GPU execution Task G: Task C: Task D: Time Fine-Grained Scheduling!16
17 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!17
18 Deriving Response-Time Bounds for a DAG* Step 1: Schedule the nodes as sporadic tasks Step 2: Compute bounds for every node Step 3: Sum the bounds of nodes on the critical path * C. Liu and J. Anderson, Supporting Soft Real-Time DAG-based Systems on Multiprocessors with No Utilization Loss, in RTSS, 2013.!18
19 Deriving Response-Time Bounds for a DAG A C B E F D!19
20 Deriving Response-Time Bounds for a DAG A C B E F D!20
21 Deriving Response-Time Bounds for a DAG CPU A B C F D GPU E Need a response-time bound analysis for GPU tasks!21
22 A system model of GPU Tasks Per-block worst-case workload τ i = (C i, T i, B i, H i ) Period Number of blocks Number of threads per block (or block size) SM1 SM C 1 T 1 H 1 = 1024 B Time τ 1 = (3076,6,2,1024)!22
23 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples.!23
24 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. Releases: Without intra-task parallelism: With intra-task parallelism: Time!24
25 Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. R k SM1 2. We then bound the unfinished workload from jobs released at or before r k,j. SM0 r k,j!25 τk,j Time 3. We prove the job finishes before r k,j + R k.
26 1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study!26
27 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) vxhogcells vxhogcells vxhogcells Node Node Node vxhogfeature vxhogfeature vxhogfeatures snode snode Node Resize Image Resize Image Resize Image Compute Compute Compute Gradients Gradients Gradients Compute Compute Orientation Orientation Compute Orientation Histograms Histograms Histograms Normalize Normalize Normalize Orientation Orientation Orientation Histograms Histograms Histograms CPU+GPU Execution (Coarse-Grained) GPU Execution (Fine-Grained)!27
28 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling Application: Histogram of Oriented Gradients (HOG) 6 instances 33 ms period 30,000 samples Platform: NVIDIA Titan V GPU + Two eight-core Intel CPUs. Schedulers: G-EDF, G-FL (fair-lateness)!28
29 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling % of samples 50% samples have response time less than 60 ms Left is better Time!29
30 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !30
31 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) !31
32 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!32
33 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!33
34 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) One-third the maximum response time!34
35 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A!35
36 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!36
37 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!37
38 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness An alert driver takes 700 ms to react. [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A!38 [3] [3]
39 Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling An alert driver takes 700 ms to react. [1] [2] Fair-lateness-based scheduler is beneficial as it reduced node response times by up to 9.9%. Overheads of supporting fine-grained scheduling was 14.15%.!39 FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) Average Response Time (ms) Maximum Response Time (ms) Analytical Bound (ms) N/A N/A [3] [3]
40 Conclusions 1. Fine-grained scheduling 2. Response-time bounds analysis for GPU tasks 3. Case study!40
41 Future Work 1. Cycles in the graph 2. Other resource constraints 3. Schedulability studies!41
42 Thanks!
Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs
Scaling Up: The Validation of Empirically Derived Scheduling Rules on NVIDIA GPUs Joshua Bakita Department of Computer Science, University of North Carolina at Chapel Hill 14th Annual Workshop on Operating
More informationEmpirical Approximation and Impact on Schedulability
Cache-Related Preemption and Migration Delays: Empirical Approximation and Impact on Schedulability OSPERT 2010, Brussels July 6, 2010 Andrea Bastoni University of Rome Tor Vergata Björn B. Brandenburg
More informationNested Multiprocessor Real-Time Locking with Improved Blocking
Nested Multiprocessor Real-Time Locking with Improved Blocking Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract Existing multiprocessor
More informationA Server-based Approach for Predictable GPU Access Control
A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D Benefits
More informationInferring Scheduling Policies of an Embedded CUDA GPU
Inferring Scheduling Policies of an Embedded CUDA GPU Nathan Otterness, Ming Yang, Tanya Amert, James H. Anderson, F. Donelson Smith 1 Department of Computer Science, University of North Carolina at Chapel
More informationMultiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous
Multiprocessor Real-Time Locking Protocols: from Homogeneous to Heterogeneous Kecheng Yang Department of Computer Science, University of North Carolina at Chapel Hill Abstract In this project, we focus
More informationPartitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions
Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions *, Alessandro Biondi *, Geoffrey Nelissen, and Giorgio Buttazzo * * ReTiS Lab, Scuola Superiore Sant Anna, Pisa, Italy CISTER,
More informationAvoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems
Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationThe case for limited-preemptive scheduling in GPUs for real-time systems
The case for limited-preemptive scheduling in GPUs for real-time systems Roy Spliet Robert Mullins (first.last@cst.cam.ac.uk) Department of Computer Science and Technology University of Cambridge GPUs
More informationBlocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks
On Spin Locks in AUTOSAR: Blocking Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks Alexander Wieder and Björn Brandenburg MPI-SWS RTSS 2013 12/04/2013 Vancouver, Canada Motivation: AUTOSAR:
More informationThe OpenVX Computer Vision and Neural Network Inference
The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationScheduling of Parallel Real-time DAG Tasks on Multiprocessor Systems
Scheduling of Parallel Real-time DAG Tasks on Multiprocessor Systems Laurent George ESIEE Paris Journée du groupe de travail OVSTR - 23 mai 2016 Université Paris-Est, LRT Team at LIGM 1/53 CONTEXT: REAL-TIME
More informationStandards for Vision Processing and Neural Networks
Copyright Khronos Group 2017 - Page 1 Standards for Vision Processing and Neural Networks Radhakrishna Giduthuri, AMD radha.giduthuri@ieee.org Agenda Why we need a standard? Khronos NNEF Khronos OpenVX
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationFixed-Priority Multiprocessor Scheduling
Fixed-Priority Multiprocessor Scheduling Real-time Systems N periodic tasks (of different rates/periods) r i T i C i T i C C J i Ji i ij i r i r i r i Utilization/workload: How to schedule the jobs to
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationFixed-Priority Multiprocessor Scheduling. Real-time Systems. N periodic tasks (of different rates/periods) i Ji C J. 2 i. ij 3
0//0 Fixed-Priority Multiprocessor Scheduling Real-time Systems N periodic tasks (of different rates/periods) r i T i C i T i C C J i Ji i ij i r i r i r i Utilization/workload: How to schedule the jobs
More informationOn Latency Management in Time-Shared Operating Systems *
On Latency Management in Time-Shared Operating Systems * Kevin Jeffay University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC 27599-3175 jeffay@cs.unc.edu Abstract: The
More informationReducing Response-Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms
Reducing Response-Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms Kecheng Yang, Ming Yang, and James H. Anderson Department of Computer Science, University of North Carolina
More informationParallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car
Parallel Scheduling for Cyber-Physical Systems: Analysis and Case Study on a Self-Driving Car Junsung Kim, Hyoseung Kim, Karthik Lakshmanan and Raj Rajkumar Carnegie Mellon University Google 2 CMU s Autonomous
More informationGPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017
1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA How Can You Gain Access to GPU Power? 3
More informationGPU 101. Mike Bailey. Oregon State University
1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA 1 How Can You Gain Access to GPU Power?
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAccelerating String Matching Using Multi-threaded Algorithm
Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationAvoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems
Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang 1, Nathan Otterness 2, Tanya Amert 3, Joshua Bakita 4, James H. Anderson 5, and F. Donelson Smith 6 1 Department
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationarxiv: v2 [cs.ds] 22 Jun 2016
Federated Scheduling Admits No Constant Speedup Factors for Constrained-Deadline DAG Task Systems Jian-Jia Chen Department of Informatics, TU Dortmund University, Germany arxiv:1510.07254v2 [cs.ds] 22
More informationApplying OpenCL. IWOCL, May Andrew Richards
Applying OpenCL IWOCL, May 2017 Andrew Richards The next generation of software will not be built on CPUs 2 On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance - Daniel
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationEhsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,
More informationPerformance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationReplica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols
Replica-Request Priority Donation: A Real-Time Progress Mechanism for Global Locking Protocols Bryan C. Ward, Glenn A. Elliott, and James H. Anderson Department of Computer Science, University of North
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationTVA: A DoS-limiting Network Architecture L
DoS is not even close to be solved : A DoS-limiting Network Architecture L Xiaowei Yang (UC Irvine) David Wetherall (Univ. of Washington) Thomas Anderson (Univ. of Washington) 1 n Address validation is
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationHybrid EDF Packet Scheduling for Real-Time Distributed Systems
Hybrid EDF Packet Scheduling for Real-Time Distributed Systems Tao Qian 1, Frank Mueller 1, Yufeng Xin 2 1 North Carolina State University, USA 2 RENCI, University of North Carolina at Chapel Hill This
More informationA MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS
A MEMORY UTILIZATION AND ENERGY SAVING MODEL FOR HPC APPLICATIONS 1 Santosh Devi, 2 Radhika, 3 Parminder Singh 1,2 Student M.Tech (CSE), 3 Assistant Professor Lovely Professional University, Phagwara,
More informationA Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications
A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications ECRTS 13 July 12, 2013 Björn B. bbb@mpi-sws.org A Rhetorical Question On uniprocessors, why do we use the
More informationA Server-based Approach for Predictable GPU Access Control
A Server-based Approach for Predictable GPU Access Control Hyoseung Kim 1, Pratyush Patel 2, Shige Wang 3, and Ragunathan (Raj) Rajkumar 2 1 University of California, Riverside 2 Carnegie Mellon University
More informationSupporting Nested Locking in Multiprocessor Real-Time Systems
Supporting Nested Locking in Multiprocessor Real-Time Systems Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract This paper presents
More informationReal-Time Architectures 2004/2005
Real-Time Architectures 2004/2005 Scheduling Analysis I Introduction & Basic scheduling analysis Reinder J. Bril 08-04-2005 1 Overview Algorithm and problem classes Simple, periodic taskset problem statement
More informationMultimedia-Systems. Operating Systems. Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg
Multimedia-Systems Operating Systems Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg WE: University of Mannheim, Dept. of Computer Science Praktische
More informationScheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures
Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA Etienne BORDE Laurent PAUTET December 13, 2018 1/28 Outline Research Context Problem Statement Scheduling MC-DAGs
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationUNIT -3 PROCESS AND OPERATING SYSTEMS 2marks 1. Define Process? Process is a computational unit that processes on a CPU under the control of a scheduling kernel of an OS. It has a process structure, called
More informationFaculty of Computer Science Institute of Systems Architecture, Operating Systems Group. vs. REAL-TIME SYSTEMS MICHAEL ROITZSCH
Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group REAL-TIME vs. SYSTEMS MICHAEL ROITZSCH DEFINITION system whose quality depends on the functional correctness of computations
More informationReference Model and Scheduling Policies for Real-Time Systems
ESG Seminar p.1/42 Reference Model and Scheduling Policies for Real-Time Systems Mayank Agarwal and Ankit Mathur Dept. of Computer Science and Engineering, Indian Institute of Technology Delhi ESG Seminar
More informationAn Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems
An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationSPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein
: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein What do we do? Enable efficient file I/O for s Why? Support diverse
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationUniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling
Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three level scheduling 2 1 Types of Scheduling 3 Long- and Medium-Term Schedulers Long-term scheduler Determines which programs
More informationPricing Intra-Datacenter Networks with
Pricing Intra-Datacenter Networks with Over-Committed Bandwidth Guarantee Jian Guo 1, Fangming Liu 1, Tao Wang 1, and John C.S. Lui 2 1 Cloud Datacenter & Green Computing/Communications Research Group
More informationA Comparative Study of the Realization of Rate-Based Computing Services in General Purpose Operating Systems
A technology for real-time computing on the desktop A Comparative Study of the Realization of Rate-Based Computing Services in General Purpose Operating Systems Kevin Jeffay Department of Computer Science
More informationTime Triggered and Event Triggered; Off-line Scheduling
Time Triggered and Event Triggered; Off-line Scheduling Real-Time Architectures -TUe Gerhard Fohler 2004 Mälardalen University, Sweden gerhard.fohler@mdh.se Real-time: TT and ET Gerhard Fohler 2004 1 Activation
More informationReal-Time Internet of Things
Real-Time Internet of Things Chenyang Lu Cyber-Physical Systems Laboratory h7p://www.cse.wustl.edu/~lu/ Internet of Things Ø Convergence of q Miniaturized devices: integrate processor, sensors and radios.
More informationExploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems
Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI
More informationGlobally Scheduled Real-Time Multiprocessor Systems with GPUs
Globally Scheduled Real-Time Multiprocessor Systems with GPUs Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract Graphics processing
More informationPractice Exercises 305
Practice Exercises 305 The FCFS algorithm is nonpreemptive; the RR algorithm is preemptive. The SJF and priority algorithms may be either preemptive or nonpreemptive. Multilevel queue algorithms allow
More informationCPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)
CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process
More informationReal-Time Systems. Hard Real-Time Multiprocessor Scheduling
Real-Time Systems Hard Real-Time Multiprocessor Scheduling Marcus Völp WS 2015/16 Outline Introduction Terminology, Notation and Assumptions Anomalies + Impossibility Results Partitioned Scheduling (no
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationComparison of scheduling in RTLinux and QNX. Andreas Lindqvist, Tommy Persson,
Comparison of scheduling in RTLinux and QNX Andreas Lindqvist, andli299@student.liu.se Tommy Persson, tompe015@student.liu.se 19 November 2006 Abstract The purpose of this report was to learn more about
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationEECS 570 Lecture 25 Genomics and Hardware Multi-threading
Lecture 25 Genomics and Hardware Multi-threading Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin,
More informationSchedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities
Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities Arpan Gujarati, Felipe Cerqueira, and Björn Brandenburg Multiprocessor real-time scheduling theory Global
More informationImplicit GPU Synchronization: A Barrier to Real-Time CUDA Workloads
Implicit GPU Synchronization: A Barrier to Real-Time CUDA Workloads Nathan Otterness, Ming Yang, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith Department of Computer Science, University
More informationA Server-based Approach for Predictable GPU Access with Improved Analysis
A Server-based Approach for Predictable GPU Access with Improved Analysis Hyoseung Kim 1, Pratyush Patel 2, Shige Wang 3, and Ragunathan (Raj) Rajkumar 4 1 University of California, Riverside, hyoseung@ucr.edu
More informationScheduling. Jesus Labarta
Scheduling Jesus Labarta Scheduling Applications submitted to system Resources x Time Resources: Processors Memory Objective Maximize resource utilization Maximize throughput Minimize response time Not
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationMultiprocessor Synchronization and Hierarchical Scheduling
Multiprocessor Synchronization and Hierarchical Scheduling Farhang Nemati, Moris Behnam, Thomas Nolte Mälardalen Real-Time Research Centre P.O. Box 883, SE-721 23 Västerås, Sweden farhang.nemati@mdh.se
More informationCSC630/COS781: Parallel & Distributed Computing
CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationDeadline-based Scheduling for GPU with Preemption Support
Deadline-based Scheduling for GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. University of Modena and Reggio Emilia NVIDIA Corp. 12/12/2018 RTSS 2018, NASHVILLE 1
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationSupporting Nested Locking in Multiprocessor Real-Time Systems
Supporting Nested Locking in Multiprocessor Real-Time Systems Bryan C. Ward and James H. Anderson Department of Computer Science, University of North Carolina at Chapel Hill Abstract This paper presents
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationDeep Learning Requirements for Autonomous Vehicles
Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationMulti-Resource Real-Time Reader/Writer Locks for Multiprocessors
Multi-Resource Real-Time Reader/Writer Locks for Multiprocessors Bryan C. Ward and James H. Anderson Dept. of Computer Science, The University of North Carolina at Chapel Hill Abstract A fine-grained locking
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationThe Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches
The Limitations of Fixed-Priority Interrupt Handling in PREEMPT RT and Alternative Approaches Glenn A. Elliott and James H. Anderson Department of Computer Science, University of North Carolina at Chapel
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationReal Time Operating Systems and Middleware
Real Time Operating Systems and Middleware Introduction to Real-Time Systems Luca Abeni abeni@disi.unitn.it Credits: Luigi Palopoli, Giuseppe Lipari, Marco Di Natale, and Giorgio Buttazzo Scuola Superiore
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationTraffic Sign Localization and Classification Methods: An Overview
Traffic Sign Localization and Classification Methods: An Overview Ivan Filković University of Zagreb Faculty of Electrical Engineering and Computing Department of Electronics, Microelectronics, Computer
More information