EXPLICIT SYNCHRONIZATION

Similar documents
DRI Memory Management

EC 513 Computer Architecture

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Bringing Vulkan to VR. Cass Everitt, Oculus

Vulkan: Scaling to Multiple Threads. Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Sync Points in the Intel Gfx Driver. Jesse Barnes Intel Open Source Technology Center

CSE 4/521 Introduction to Operating Systems. Lecture 12 Main Memory I (Background, Swapping) Summer 2018

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

Generic Buffer Sharing Mechanism for Mediated Devices

Kotlin/Native concurrency model. nikolay

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 6. Buffer cache. Code: Data structures

Processes, PCB, Context Switch

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Predictable Programming on a Precision Timed Architecture

The University of Texas at Arlington

Unix Device Memory. James Jones XDC 2016

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Integrating DMA Into the Generic Device Model

19: I/O. Mark Handley. Direct Memory Access (DMA)

Threads and Synchronization. Kevin Webb Swarthmore College February 15, 2018

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Implementing Symmetric Multiprocessing in LispWorks

Modeling Page Replacement: Stack Algorithms. Design Issues for Paging Systems

Sistemas Operacionais I. Valeria Menezes Bastos

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Expose NVIDIA s performance counters to the userspace for NV50/Tesla

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

Mainline Explicit Fencing

First-In-First-Out (FIFO) Algorithm

COS 318: Midterm Exam (October 23, 2012) (80 Minutes)

CS370 Operating Systems

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Interprocess Communication By: Kaushik Vaghani

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CS140 Project 3: Virtual Memory. TA: Diveesh Singh

Memory Consistency Models

Memory Management in Tizen. SW Platform Team, SW R&D Center

Memory management. Johan Montelius KTH

Scuola Superiore Sant Anna. I/O subsystem. Giuseppe Lipari

Synchronization SPL/2010 SPL/20 1

Status Report 2015/09. Alexandre Courbot Martin Peres. Logo by Valeria Aguilera, CC BY-ND

CS370 Operating Systems

More Types of Synchronization 11/29/16

Agenda. Highlight issues with multi threaded programming Introduce thread synchronization primitives Introduce thread safe collections

Using Relaxed Consistency Models

The University of Texas at Arlington

CS370 Operating Systems

CS2028 -UNIX INTERNALS

Dealing with Issues for Interprocess Communication

Multiprocessors and Locking

Foundations of the C++ Concurrency Memory Model

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

RCU in the Linux Kernel: One Decade Later

CS370 Operating Systems Midterm Review

Virtual Memory Outline

Architectural Support for Operating Systems

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

CHAPTER 6: PROCESS SYNCHRONIZATION

CS 4410 Operating Systems. Review 1. Summer 2016 Cornell University

Concurrency: Deadlock and Starvation

Chapter 3 Processes. Process Concept. Process Concept. Process Concept (Cont.) Process Concept (Cont.) Process Concept (Cont.)

Caching and reliability

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Lecture: Consistency Models, TM

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Last class: Today: Deadlocks. Memory Management

(b) External fragmentation can happen in a virtual memory paging system.

Moore s Law. Computer architect goal Software developer assumption

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

CPSC 261 Midterm 2 Thursday March 17 th, 2016

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Relaxed Memory-Consistency Models

VGA Assignment Using VFIO. Alex Williamson October 21 st, 2013

Module 1. Introduction:

2 nd Half. Memory management Disk management Network and Security Virtual machine

Concurrent Programming in the D Programming Language. by Walter Bright Digital Mars

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Main Points of the Computer Organization and System Software Module

2.c Concurrency Mutual exclusion & synchronization mutexes. Unbounded buffer, 1 producer, N consumers

A Basic Snooping-Based Multi-Processor Implementation

Message-Passing Shared Address Space

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

OPERATING SYSTEMS ASSIGNMENT 2 SIGNALS, USER LEVEL THREADS AND SYNCHRONIZATION

Four Components of a Computer System

Problem Set 2. CS347: Operating Systems

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Transcription:

EXPLICIT SYNCHRONIZATION Lauri Peltonen XDC, 8 October, 204

WHAT IS EXPLICIT SYNCHRONIZATION? Fence is an abstract primitive that marks completion of an operation Implicit synchronization Fences are attached to buffers Kernel manages fences automatically based on buffer read/write access Currently used by DRM (dma-buf fences) Explicit synchronization Fences are passed around independently Kernel takes and emits fences to/from user space when submitting work Currently used on Android (sync fence fd's) 2

ADVANTAGES Improved performance of bindless graphics APIs Better alignment with user space graphics APIs Allow parallel processing of user space suballocations Fits in nicely with explicit buffer handoffs 3

BINDLESS GRAPHICS PERF IMPROVEMENTS Bindless graphics and Compute APIs allow building very large working sets that any given command buffer can reference References can be by runtime-generated virtual address rather than slots or enums These working sets can be shared across multiple contexts or command queues Implicit sync may force serialization in these cases Locking and updating fences for every active buffer is costly Working set sizes can be thousands of buffers 4

ALIGNS WITH USERSPACE GRAPHICS APIS Developers are demanding explicit control of the driver behavior and hardware whenever possible Current Generation OpenGL is defined in terms of explicit synchronization EGLSync, GLSync Hidden ordering dependencies and stalls because of implicit sync are at odds with these design philosophies 5

USER SPACE SUBALLOCATION User space drivers and applications use suballocation for performance reasons By definition, kernel has no visibility into this process Operations on separate portions of a buffer should be allowed to proceed in parallel Even if they reside in one kernel-visible buffer 6

EXPLICIT INTEROP HANDOFFS Modern processors have many specialized engines Video processing 3D/2D graphics CPU cores Each of these may have its own caches, memory compression engines, or other specialized memory access quirks When buffers are shared between them, engine-specific state transitions may be needed May be costly operations. May be difficult to perform just-in-time. Simplest solution is for user space to request them explicitly Might as well do explicit synchronization in the same code path 7

Channel Channel 2 Channel 3 IMPLICIT SYNC EXAMPLE 8

Channel Channel 2 Channel 3 IMPLICIT SYNC EXAMPLE nouveau_pushbuf_kick(push, chan); for (each buffer in working set) acquire ww mutex for (each buffer in working set) program wait fence cmd submit work for (each buffer in working set) { store fence release ww mutex } nouveau_pushbuf_kick( struct nouveau_pushbuf *push, struct nouveau_object *chan) 9

Channel Channel 2 Channel 3 IMPLICIT SYNC EXAMPLE nouveau_pushbuf_kick(push, chan); // push2 has no dependencies, but kernel enforces a wait nouveau_pushbuf_kick(push2, chan2); waiting 2 nouveau_pushbuf_kick( struct nouveau_pushbuf *push, struct nouveau_object *chan) 0

Channel Channel 2 Channel 3 IMPLICIT SYNC EXAMPLE nouveau_pushbuf_kick(push, chan); // push2 has no dependencies, but kernel enforces a wait nouveau_pushbuf_kick(push2, chan2); waiting // push2 depends on push only, but user space cannot // express that to kernel nouveau_pushbuf_kick(push3, chan3); waiting 2 2 nouveau_pushbuf_kick( struct nouveau_pushbuf *push, struct nouveau_object *chan)

Channel Channel 2 Channel 3 EXPLICIT SYNC EXAMPLE 2

EXPLICIT SYNC EXAMPLE Channel Channel 2 Channel 3 int fence = -; nouveau_pushbuf_kick_fence(push, chan, -, &fence); // now fence == nouveau_pushbuf_kick_fence( struct nouveau_pushbuf *push, struct nouveau_object *chan, int waitfencefd, int *emitfencefd) 3

EXPLICIT SYNC EXAMPLE Channel Channel 2 Channel 3 int fence = -; nouveau_pushbuf_kick_fence(push, chan, -, &fence); // now fence == int fence2 = -; nouveau_pushbuf_kick_fence(push2, chan2, -, &fence2); // now fence2 == 2 2 nouveau_pushbuf_kick_fence( struct nouveau_pushbuf *push, struct nouveau_object *chan, int waitfencefd, int *emitfencefd) 4

EXPLICIT SYNC EXAMPLE Channel Channel 2 Channel 3 int fence = -; nouveau_pushbuf_kick_fence(push, chan, -, &fence); // now fence == int fence2 = -; nouveau_pushbuf_kick_fence(push2, chan2, -, &fence2); // now fence2 == 2 waiting // the last operation depends on only nouveau_pushbuf_kick_fence(push3, chan3, fence, NULL); 2 nouveau_pushbuf_kick_fence( struct nouveau_pushbuf *push, struct nouveau_object *chan, int waitfencefd, int *emitfencefd) 5

EXPLICIT SYNC EXAMPLE Channel Channel 2 Channel 3 int fence = -; nouveau_pushbuf_kick_fence(push, chan, -, &fence); // now fence == int fence2 = -; nouveau_pushbuf_kick_fence(push2, chan2, -, &fence2); // now fence2 == 2 // the last operation depends on and 2 int merged = sync_merge(fence, fence2); nouveau_pushbuf_kick_fence(push3, chan3, merged, NULL); waiting + 2 2 nouveau_pushbuf_kick_fence( struct nouveau_pushbuf *push, struct nouveau_object *chan, int waitfencefd, int *emitfencefd) 6

RESIDENCY AND PINNING When we need to swap out or unmap a buffer, we need to wait until it is no longer accessed by hw This is not the perf-critical case, so we can be conservative in order to optimize the critical path. For example, on Nouveau: Store one fence to channel vm at each submit Use that fence when evicting or unmapping buffers No need to lock / update fences to every buffer individually at submit? All this is driver specific logic, not common DRM 7

PATH FROM IMPLICIT SYNC -> EXPLICIT SYNC No need to disrupt existing model If a particular device is happy with implicit sync, it can keep using it Allow kernel and user space drivers that prefer explicit to opt-in: Allow user space to handle intra-driver synchronization explicitly Allow user space to associate synchronization primitives with buffers for backwards compatibility with current APIs and drivers Move towards tracking working sets rather than individual buffers for object lifetime/work completion/paging purposes 8

THANKS! drivers/staging/android/sync.c [RFC] Explicit synchronization for Nouveau (+ RFC patches) dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org Let s discuss more over lunch/dinner! 9

BACKUP 20

DEADLOCKS? Circular dependencies can be avoided, if fences are only generated in kernel when work is submitted This guarantees that user space cannot ask kernel to wait for a fence whose work will be submitted later Deadlocks can be avoided, if additionally all submitted work completes in finite time This assumption might fail for implicit fences also Timeout mechanisms 2

EXPLICIT SYNC VS. ANDROID SYNC FD S Could also be a process local handle? But should support conversion to and from Android sync fd s 22