Utilizing the IOMMU scalably - PDF Free Download

Utilizing the IOMMU scalably Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir USENIX ATC 15 2017711456 Shin Seok Ha

1 Introduction What is an IOMMU? Provides the translation between IO addresses and Physical addresses. Just as MMU does.

2 The importance of IOMMU Memory System data IO buf 1 X X IO buf 2 Device IOMMU Memory 4GB Device1 Device2 16GB A malicious or buggy device may interfere the area that should not be accessed. The range that device can access gets larger than the device itself can.

3 Introduction Non scalable IOMMU IO Virtual Address Allocator IOTLB Invalidation

4 IO Virtual Address Allocator Centralized IOVA allocator..... rb tree List of free IOVA ranges EiovaR : Reduces time to traverse the tree Request IOVA Attempting to acquire the lock Task1 Task2 Task3

5 IOTLB Invalidation IOMMU Memory Invalidation queue Device IOTLB IOVA 0x2000 0x3000 HPA 0x00FF 0x00FE IO page table Invalidation should occur in every unmap operation. Invalidation queue requires global lock. Task should wait for IO to be completed.

6 Deferred Invalidation Batch invalidation commands up to 250 and flush Or in every 10ms. Accessing the global queue still produces a huge overhead Security Issue while the mapping exists after unmap

16 core parallel netperf RR workload cycle breakdown and throughput (270 instances) 7 Invalidation overhead masks allocation overhead. Long invalidation period naturally throttles map/unmap operations. Both kinds of overheads should be treated altogether.

8 Suggested solutions Scalable IOVA allocation - Dynamic Identity Mapping - IOVA-kmalloc - Magazines Per-core flush queue

9 Dynamic Identity Mapping IOMMU Device IOVA 0x2000 0x3000 HPA 0x0020 0x0030 Physical Address 1 to 1 relation between HPA and IOVA No need to access rb tree. Physically contiguous buffer Eliminating works and locks used to manage a distinct space of IOVAs.

Each mapping is associated with a distinct IOVA. Several mappings in a same page can be linked with different page table entries. Page 0x3000 10 Page table entry 0x3000 3 0x3000 0x3000 Reference count must be recorded to decide when to remove this entry -> atomic operation required

11 Dynamic Identity Mapping Conflicting access Permissions.. 48 bits.. 46 bits Read Write IOVA PA Spare bits Read, Write, R&W, and etc. When mapping is failed with this method. Go back to the traditional method.

Dynamic Identity Mapping When to go back to the old way? 12 Non contiguous buffers PTE reference count overflow - 10 reference bits

13 IOVA-kmalloc Obtain IOVA range using kmalloc allocator, and use the address of block as an IOVA. Easy implementation, well optimized. Uniquie address per allocation -> No worries whether IOMMU pages conflict.

14 kmalloc R/4096 bytes Memory PA 36 bits IOVA 36 bits as PFN 10 bits as page offset

15 IOVA-kmalloc Obstacles Pages not packed as dense as the old way. Required number of page tables should be larger. Linux does not reclaim pages used as page tables. Possibility to cause memory blowup Collisions might occur between two different IOVAs Use GFP_DMA flag to limit the area

16 Magazine allocator Per-core Cache of previously deallocated IOVA ranges. The global allocator will not be invoked until percore cache uses its all capacity. Core1 Core2 IOVA IOVA. Depot IOVA IOVA IOVA IOVA IOVA IOVA M elements. A Core s miss rate is bounded by 1/M

17 Scalable IOTLB Invalidation A global flush queue is overkill, with its associated lock contention. - There is no dependency between the invalidation process of distinct IOVA ranges. Per-core flush queue Requirements Until an entire IOVA range mapping is invalidated in IOTLB, 1. The IOVA range will not be released back to the IOVA allocator 2. The page tables that were mapping the IOVA range will not be reclaimed.

18 Scalable IOTLB Invalidation The owning core almost acquires the lock on the queue. (Except global invalidation) Core1 Core2 Flush queue Cyclic invalidation queue

19 Evaluation Rack server Processors Hyperthreading Memory NIC Server Dell PowerEdge R430 Client Dual 2.4GHz Intel Xeon E5-2630v3 8-core Disabled 32GB 2133MHz Broadcom NetXtreme II BCM57810 10Gb/s Intel 82599 10Gb/s Base OS Ubuntu 3.17.2 Ubuntu 3.13.0-45

20 Evaluation 15 rings on server NIC Benchmarks are executed in a round-robin fashion. Each benchmark runs once for 10 seconds for each possible number of cores. The cycle is run 5 times.

21 Benchmarks Netperf : network analysis tool 270 instances of the TCP RR test on the client - Measuring the throughput As many instances as we allow server cores - Measuring the parallel latency Memcached : a key-value store service used by web applications. 16 threads, 256 concurrent requests. - Measuring the throughput

22 Results - Throughput netperf memcached The suggested designs show 90%-95% throughtput of No-IOMMU design. IOMMU overhead is essentially constant and does not get worse with more concurrency.

23 Results - Latency At 16 cores, EiovaR shows a gap due to the lock Suggested designs are on par with No-IOMMU

Results Memory consumption 24 kmalloc : tradeoff between memory and performance DIM : no effort to pack pages densely Magazines : Not returning freed IOVAs to global allocator

25 Conclusion IOVA allocation and IOTLB invalidation are two main bottlenecks for scalability. Dynamic Identity mapping is efficient but make additional cost to manage IOMMU page table. IOVA-kmalloc is simple to implement with high performance, but suffers from unbounded page table blowup Magazine also shows high performance, however deferring freed IOVA return could be an obstacle.

26 Dynamic Identity Mapping Conflicting Access Permissions Page table entry 0x3000 2 Read Write Must handle with the case that buffers with different permissions stay in a same page. Promoting the permission is prohibited.