Integrating DMA Into the Generic Device Model

Similar documents
Kernel Korner. SG Lists. DMA makes I/O go faster by letting devices read and write memory without bothering the CPU. by James Bottomley

Integrating DMA Into the Generic Device Model

05/12/11 10:39:08 linux-processor-caches-linux-journal-2004.txt 1

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Cache Coherence and Atomic Operations in Hardware

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Chapter 3 - Memory Management

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Chapter 8: Memory-Management Strategies

Overview of the x86-64 kernel. Andi Kleen, SUSE Labs, Novell Linux Bangalore 2004

Device I/O Programming

Chapter 8: Main Memory

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Virtual Memory #2 Feb. 21, 2018

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Key Point. What are Cache lines

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CS140 Project 3: Virtual Memory. TA: Diveesh Singh

Chapter 5 - Input / Output

CSE 120 Principles of Operating Systems

Chapter 8: Main Memory

Topic 18: Virtual Memory

15 Sharing Main Memory Segmentation and Paging

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

MEMORY MANAGEMENT UNITS

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

PROCESS VIRTUAL MEMORY PART 2. CS124 Operating Systems Winter , Lecture 19

The Operating System. Chapter 6

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Motivation, Concept, Paging January Winter Term 2008/09 Gerd Liefländer Universität Karlsruhe (TH), System Architecture Group

Ext3/4 file systems. Don Porter CSE 506

16 Sharing Main Memory Segmentation and Paging

Today: I/O Systems. Architecture of I/O Systems

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Chapter 8 Memory Management

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

DMA safety in buffers for Linux Kernel device drivers

ECE 571 Advanced Microprocessor-Based Design Lecture 12

Virtual Memory: Mechanisms. CS439: Principles of Computer Systems February 28, 2018

CS307: Operating Systems

Lecture 21. Monday, February 28 CS 470 Operating Systems - Lecture 21 1

Input-Output (I/O) Input - Output. I/O Devices. I/O Devices. I/O Devices. I/O Devices. operating system must control all I/O devices.

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Virtual Memory. Today. Handling bigger address spaces Speeding translation

COMP9242 Advanced OS. S2/2017 W03: Caches: What Every OS Designer Must

CS Operating Systems

CS Operating Systems

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Virtual Memory: From Address Translation to Demand Paging

Introduction to Computing and Systems Architecture

CS Final Exam. Stanford University Computer Science Department. June 5, 2012 !!!!! SKIP 15 POINTS WORTH OF QUESTIONS.!!!!!

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

OPERATING SYSTEMS CS136

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

18-447: Computer Architecture Lecture 16: Virtual Memory

Chapters 9 & 10: Memory Management and Virtual Memory

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher

Caching and Buffering in HDF5

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Virtual Memory. CS 351: Systems Programming Michael Saelee

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Topic Notes: MIPS Instruction Set Architecture

Operating Systems Design Exam 2 Review: Spring 2011

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Computer Architecture Memory hierarchies and caches

The Virtual Contiguous Memory Manager

CSE 120 Principles of Operating Systems

CS 416: Opera-ng Systems Design March 23, 2012

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

OPERATING SYSTEM. Chapter 9: Virtual Memory

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

CS 3733 Operating Systems:

6 - Main Memory EECE 315 (101) ECE UBC 2013 W2

Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

I/O. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 6.5-6

CS 5523 Operating Systems: Memory Management (SGG-8)

CS510 Operating System Foundations. Jonathan Walpole

CS 111. Operating Systems Peter Reiher

Virtual Memory 2. To do. q Handling bigger address spaces q Speeding translation

Transcription:

Integrating DMA Into the Generic Device Model James Bottomley SteelEye Technology 26 July 2003 1

In the Beginning There was Programmed I/O (PIO). The processor coped with all the quirky device timings and spoon fed the device a byte/word/quad at a time at the correct rate. This was very, very slow. And wasted an awful lot of CPU cycles doing timings. But then if you have old IDE you already know this. Then came...dma. 2

What Is DMA? Direct Memory Access (DMA) is simply the ability of an I/O device to read or write memory directly without the intervention of the Processor. Fundamentally, all devices that transfer reasonable amounts of data need to use DMA. Because the processor isn t required, it can perform other tasks while the DMA is going on (well, as long as the DMA doesn t lock it off the memory bus). Principal problem is that the kernel thinks in terms of virtual addresses and DMA must be performed to physical addresses. 3

DMA Just Works, Doesn t It? In 2.2 and early 2.4, DMA was controlled simply by two APIs: virt to bus bus to virt and it just worked. Everyone was happy. in later 2.4 an entire new (and complex) DMA Mapping API was introduced, necessitating the conversion of every driver in the kernel. 4

What is Wrong With the Original Approach? The basic problems is that the simple approach really only works on x86. Worse, it only really works on x86 up to 4GB of memory. It only apparently worked for non-x86 a lot of work went on under the covers to give this appearance. In 2.4, the decision was finally made to tackle this properly, hence the DMA Mapping API. 5

What are the Critical Problems? Cache Coherency The coherency of the processor cache may need to be managed by the driver (most non-x86 architectures). IO-MMUs The whole virtual memory problem occurs because the CPU has a Memory Management Unit to do address translation Can put one of these MMUs across the I/O bus (as well as on the CPU) This means that I/O drivers now need to manage the mappings for this IO-MMU. 6

Cache Layout CPU L1 Cache L2 Cache Main Memory I/O Bus 7

Cache Coherency Problems Any write From I/O to memory may change data that is stored in one of the CPUs caches. If the CPU doesn t watch for this and manage its caches accordingly, it must explicitly be made aware of the changes using cache management instructions. invalidate simply evicts a cache line from the CPU cache. writeback causes the CPU to flush a dirty cache line into memory writeback/invalidate 8

Cache Width The minimum number of bytes the CPU can read/write from memory into the cache is called the cache width Sometimes the width is different for read/write (or even for L1/L2 cache fills and evictions). However, the largest number is the one called the cache width. To PCI people, this is also called the cache line size. This quantity is very CPU (and sometimes bus) specific. 9

Cache Line Interference Cache Line Width 0x00 0x10 0x20 DMA DMA data data Time Line CPU Writes data at 0x18 0x00 DMA 0x10 0x20 DMA data data CPU Cache DMA new data Device does DMA to main memory 0x00 0x10 0x20 new DMA new DMA data data CPU Cache DMA new data CPU flushes the dirty cache line 0x00 0x10 0x20 DMA new DMA new data data 10

Dealing with Cache Line Interference Simplest way to avoid this problem is always to follow the rules: 1. Never, never, ever do DMA onto the stack 2. kmalloc() is aware of the cache line constraints and, by and large, never allocates memory which violates them. 3. Don t mix DMA and non-dma data in a structure 4. If you have to break rule 3 keep the DMA and non-dma areas separate (and make sure they re both cache line aligned). 11

Inducing Coherency On some Architectures, certain areas of memory can be made Coherent. Most common IOMMUs can be made to participate in the CPU coherency model, so requiring coherency may sometimes be programmed into it per mapping. Even if it cannot, you may be able to fake it by turning off the CPU s cache per page (slower, but works). Some CPU s have absolutely no way at all to induce coherency. Note: For PCI people Coherent == Consistent Non Coherent == Streaming 12

IOMMUs Main Memory IOMMU I/O Bus Memory Physical Addresses Bus Physical Addresses MMU CPU CPU Virtual Addresses The IOMMU is a memory mapping unit sitting between the I/O bus and physical memory. 13

IOMMUs Continued In order to begin a DMA transaction, the translations between the Memory and Bus physical addresses mus be programmed into the IOMMU. One useful feature of this is that any attempt to DMA to memory ranges outside the programmed translation can be caught (and even before it corrupts memory!) At the end of the DMA, the mappings must be removed from the IOMMU (otherwise they ll build up until you run out of mappings) 14

GARTs These are Graphical Aperture Remapping Tables Originally designed to give video cards on the AGP bus apparently large contiguous areas of physical memory. A GART acts like a simple IOMMU The Aperture may be fixed by the bios or variable It occupies usually a window in memory Any bus use of a physical address in this window may be remapped into a different physical memory location. 15

Memory Layout for a GART GART Address within the Graphical Aperture Main Memory I/O Bus Memory Physical Addresses MMU CPU CPU Virtual Addresses 16

The DMA Mask Every device has a specific range of physical memory it can address This range may not correspond exactly to the amount of memory available in the system (rendering some memory non-addressable). Old ISA devices can only address up to the first 16MB of memory. Even some modern PCI devices may only be able to address up to the first 4GB of physical memory. This range is encoded in a mask, called the DMA mask 17

Using the DMA Mask The DMA mask has two distinct uses depending on whether an IOMMU is present in the system or not. For non-iommu systems, it represents the physical maximum address that may be DMA d to. Once over this address, the data must be copied to/from a lower address to perform DMA (this process is called bouncing). If an IOMMU is present, the device may still DMA to all of physical memory, and the DMA mask is used by the IOMMU to determine what bus physical addresses it should use for the device. 18

Bouncing Bouncing refers to the task of moving data from device inaccessible memory to memory the device can DMA to. It is only necessary in non-iommu systems. For block devices, the bouncing is done per page as the bio is processed. The network layer has its own bouncing system too. And for good measure, IA-64 has an additional bouncing system coded to look like a fake memory management unit. 19

The PCI DMA API Introduced as an API to encapsulate the solution to all of the coherency/iommu problems Has: Mapping and Unmapping API Cache Synchronisation API Coherent (Consistent) memory allocation API To emphasise the PCI nature, the API begins pci always takes a pointer to a struct pci device. and Sparc also has a completely equivalent sbus takes a struct sbus device. API which 20

The PCI DMA API Problems It only applies to PCI (and SBUS). We have many more bus types that need it (EISA, GSC, MCA, USB,...) Even if we introduced more <bus> variants, some drivers don t know the underlying bus type (or must drive a large number of buses). Can correct this by making fake PCI devices for these buses (callbacks are architecture specific) if you only need to use the IOMMU, NULL works as the struct pci dev pointer. Doesn t work at all for fully incoherent architectures. There are, actually, several problems with the API itself. 21

Illustration of a Bus Rich Architecture Memory CPU Runway U2/Uturn IOMMU U2/Uturn IOMMU GSC/10 GSC/8 Cujo Dino Wax Lasi PCI64 PCI32 EISA LASI 22

Enter the Generic Device Model The generic device model requires that a struct device be embedded in every bus specific device type. The entire bus tree may be built from the parent/child relationships of the generic devices. This provides the ideal framework for building a generic API which simply takes a pointer to the generic device (and leaves any bus specific issues to the underlying architecture) Approach greatly simplifies the job of driver writers because the API is the same regardless of the bus. 23

Illustration EISA EISA is an old bus type, but it is present in several architectures (x86, Alpha, parisc). A difficult one, because neither NULL PCI device nor a fake one provides the answer. However, once the EISA bus type was added (with minimal architecture specific support) it just simply worked. Well, OK, the drivers had to be converted as well. Good proof of concept. 24

Non-Coherent Architectures Problem is that pci alloc consistent can fail in two ways: 1. Because you re out of coherent memory, so you should retry/abort the operation. 2. Because you re on a non-coherent architecture (the call can never succeed). Ideally, you d like to treat consistent memory allocation failure as always fatal and provide a special API to drivers that know they may be used on non-coherent platforms. Without having to do if(coherent) switches in the driver 25

Other Extensions in the New DMA API In addition to the incoherent memory API (which is dangerous), there are two other API additions which are classified as extremely dangerous And I mean extremely dangerous Cache width API (dma get cache alignment()) allows returns the correct cache width dynamically. (The #define is usually the largest possible value for this) Partial sync API. This allows the partial synchronisation of the memory area (instead of fully synchronising it) Again, you should never be using either of these unless you really, really know what you re doing. 26

Current Problems with Both APIs There are really three glaring issues with both the generic device and the pci specific memory mapping APIs: 1. The mapping APIs have no failure returns. 2. There s no way to tell what an appropriate dma mask for the system is 3. The cache coherency API isn t as clearly usable and implementable as it should be. 27

Mapping Failures Originally, in the IOMMU model, usually up to 4GB of memory may be mapped at any one time, so you can never run out of mappings, right? Wrong: in super High End machines, we re already getting into situations where 4GB of data may be in flight at once GARTs may have really small apertures (around 128k) If you re stuck with a GART as your IOMMU you are almost always going to hit a failure even under normal load. Adding error returns is now required. 28

Appropriate DMA mask sizing Even on a 64 bit machine, you may have < 4GB of memory. Most adapters that can DMA directly into all 64 bits have more than one descriptor mode e.g. aic79xx has three descriptor types, 64 bit, 39bit and 32bit. Often, the I/O can go slightly faster if you use smaller descriptors, so even on a 64bit machine, if it has less than 4GB memory you want to use 32 bit descriptors. Therefore, need to expand the return of dma set mask so that zero is error, but otherwise it returns the mask required by the platform (often just the mask required to reach available memory). 29

Cache Coherency API Current API has three hint flags: bidirectional, from device and to device. However, when to use these in the driver requires knowledge of how cache coherency works. An ownership model would have been much more appropriate Now, The data is either owned by the CPU or owned by the device (you still flag it with the same directions to give cache hints). 30

Old Cache API dma map device writes to memory dma sync(dma FROM DEVICE) CPU snoops data and asks for more device writes to memory dma unmap 31

Proposed New Cache API dma map device owns the memory device writes to memory dma sync to cpu(dma FROM DEVICE) CPU owns the memory CPU snoops data and asks for more dma sync to device(dma FROM DEVICE) device owns the memory again device writes to memory dma unmap CPU owns the memory again 32

Conclusions with the new API life is definitely much easier for some architectures and buses. The new API definitely does not solve all the problems However, it does solve some of the corner cases of the previous API. The API will still change to accommodate the three listed issues. 33