Utilizing the IOMMU scalably

Similar documents
Utilizing the IOMMU Scalably

Utilizing the IOMMU Scalably

Efficient Intra-Operating System Protection Against Harmful DMAs

Arrakis: The Operating System is the Control Plane

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Operating System Performance and Large Servers 1

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Virtualization, Xen and Denali

Improve VNF safety with Vhost-User/DPDK IOMMU support

RIGHTNOW A C E

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

VMware VMmark V1.1 Results

FAQ. Release rc2

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

The Price of Safety: Evaluating IOMMU Performance

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Accelerate Applications Using EqualLogic Arrays with directcache

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Operating Systems Design Exam 2 Review: Spring 2012

Advanced Operating Systems (CS 202) Virtualization

Nested Virtualization Update From Intel. Xiantao Zhang, Eddie Dong Intel Corporation

PostgreSQL as a benchmarking tool

Operating Systems. 11. Memory Management Part 3 Kernel Memory Allocation. Paul Krzyzanowski Rutgers University Spring 2015

references Virtualization services Topics Virtualization

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

(b) External fragmentation can happen in a virtual memory paging system.

Netchannel 2: Optimizing Network Performance

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

COL862 Programming Assignment-1

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

The Google File System

Benchmark Performance Results for Pervasive PSQL v11. A Pervasive PSQL White Paper September 2010

Xen Network I/O Performance Analysis and Opportunities for Improvement

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Status of the Linux Slab Allocators

1-Gigabit TCP Offload Engine

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Fairness Issues in Software Virtual Routers

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers

Reduce Costs & Increase Oracle Database OLTP Workload Service Levels:

UNIT III MEMORY MANAGEMENT

Use of the Internet SCSI (iscsi) protocol

COMP9242 Advanced OS. S2/2017 W03: Caches: What Every OS Designer Must

CSE 120 Principles of Operating Systems

Paging. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Keeping up with the hardware

CHAPTER 2: PROCESS MANAGEMENT

Using Transparent Compression to Improve SSD-based I/O Caches

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Virtual Switch Acceleration with OVS-TC

The Google File System

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Epilogue. Thursday, December 09, 2004

Virtual Virtual Memory

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

VMware VMmark V1.1 Results

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Four-Socket Server Consolidation Using SQL Server 2008

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

Virtual Memory. Virtual Memory

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

Multiprocessor Systems. Chapter 8, 8.1

RiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Non-Volatile Memory Through Customized Key-Value Stores

Oracle Database 12c: JMS Sharded Queues

Dongjun Shin Samsung Electronics

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Designing a True Direct-Access File System with DevFS

Task Scheduling of Real- Time Media Processing with Hardware-Assisted Virtualization Heikki Holopainen

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I

Implementation and Evaluation of Moderate Parallelism in the BIND9 DNS Server

Chapter 12: File System Implementation

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Multiprocessor Systems. COMP s1

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Software Routers: NetMap

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

CSE 120 Principles of Operating Systems

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory Oct. 29, 2002

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Transcription:

Utilizing the IOMMU scalably Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir USENIX ATC 15 2017711456 Shin Seok Ha

1 Introduction What is an IOMMU? Provides the translation between IO addresses and Physical addresses. Just as MMU does.

2 The importance of IOMMU Memory System data IO buf 1 X X IO buf 2 Device IOMMU Memory 4GB Device1 Device2 16GB A malicious or buggy device may interfere the area that should not be accessed. The range that device can access gets larger than the device itself can.

3 Introduction Non scalable IOMMU IO Virtual Address Allocator IOTLB Invalidation

4 IO Virtual Address Allocator Centralized IOVA allocator..... rb tree List of free IOVA ranges EiovaR : Reduces time to traverse the tree Request IOVA Attempting to acquire the lock Task1 Task2 Task3

5 IOTLB Invalidation IOMMU Memory Invalidation queue Device IOTLB IOVA 0x2000 0x3000 HPA 0x00FF 0x00FE IO page table Invalidation should occur in every unmap operation. Invalidation queue requires global lock. Task should wait for IO to be completed.

6 Deferred Invalidation Batch invalidation commands up to 250 and flush Or in every 10ms. Accessing the global queue still produces a huge overhead Security Issue while the mapping exists after unmap

16 core parallel netperf RR workload cycle breakdown and throughput (270 instances) 7 Invalidation overhead masks allocation overhead. Long invalidation period naturally throttles map/unmap operations. Both kinds of overheads should be treated altogether.

8 Suggested solutions Scalable IOVA allocation - Dynamic Identity Mapping - IOVA-kmalloc - Magazines Per-core flush queue

9 Dynamic Identity Mapping IOMMU Device IOVA 0x2000 0x3000 HPA 0x0020 0x0030 Physical Address 1 to 1 relation between HPA and IOVA No need to access rb tree. Physically contiguous buffer Eliminating works and locks used to manage a distinct space of IOVAs.

Each mapping is associated with a distinct IOVA. Several mappings in a same page can be linked with different page table entries. Page 0x3000 10 Page table entry 0x3000 3 0x3000 0x3000 Reference count must be recorded to decide when to remove this entry -> atomic operation required

11 Dynamic Identity Mapping Conflicting access Permissions.. 48 bits.. 46 bits Read Write IOVA PA Spare bits Read, Write, R&W, and etc. When mapping is failed with this method. Go back to the traditional method.

Dynamic Identity Mapping When to go back to the old way? 12 Non contiguous buffers PTE reference count overflow - 10 reference bits

13 IOVA-kmalloc Obtain IOVA range using kmalloc allocator, and use the address of block as an IOVA. Easy implementation, well optimized. Uniquie address per allocation -> No worries whether IOMMU pages conflict.

14 kmalloc R/4096 bytes Memory PA 36 bits IOVA 36 bits as PFN 10 bits as page offset

15 IOVA-kmalloc Obstacles Pages not packed as dense as the old way. Required number of page tables should be larger. Linux does not reclaim pages used as page tables. Possibility to cause memory blowup Collisions might occur between two different IOVAs Use GFP_DMA flag to limit the area

16 Magazine allocator Per-core Cache of previously deallocated IOVA ranges. The global allocator will not be invoked until percore cache uses its all capacity. Core1 Core2 IOVA IOVA. Depot IOVA IOVA IOVA IOVA IOVA IOVA M elements. A Core s miss rate is bounded by 1/M

17 Scalable IOTLB Invalidation A global flush queue is overkill, with its associated lock contention. - There is no dependency between the invalidation process of distinct IOVA ranges. Per-core flush queue Requirements Until an entire IOVA range mapping is invalidated in IOTLB, 1. The IOVA range will not be released back to the IOVA allocator 2. The page tables that were mapping the IOVA range will not be reclaimed.

18 Scalable IOTLB Invalidation The owning core almost acquires the lock on the queue. (Except global invalidation) Core1 Core2 Flush queue Cyclic invalidation queue

19 Evaluation Rack server Processors Hyperthreading Memory NIC Server Dell PowerEdge R430 Client Dual 2.4GHz Intel Xeon E5-2630v3 8-core Disabled 32GB 2133MHz Broadcom NetXtreme II BCM57810 10Gb/s Intel 82599 10Gb/s Base OS Ubuntu 3.17.2 Ubuntu 3.13.0-45

20 Evaluation 15 rings on server NIC Benchmarks are executed in a round-robin fashion. Each benchmark runs once for 10 seconds for each possible number of cores. The cycle is run 5 times.

21 Benchmarks Netperf : network analysis tool 270 instances of the TCP RR test on the client - Measuring the throughput As many instances as we allow server cores - Measuring the parallel latency Memcached : a key-value store service used by web applications. 16 threads, 256 concurrent requests. - Measuring the throughput

22 Results - Throughput netperf memcached The suggested designs show 90%-95% throughtput of No-IOMMU design. IOMMU overhead is essentially constant and does not get worse with more concurrency.

23 Results - Latency At 16 cores, EiovaR shows a gap due to the lock Suggested designs are on par with No-IOMMU

Results Memory consumption 24 kmalloc : tradeoff between memory and performance DIM : no effort to pack pages densely Magazines : Not returning freed IOVAs to global allocator

25 Conclusion IOVA allocation and IOTLB invalidation are two main bottlenecks for scalability. Dynamic Identity mapping is efficient but make additional cost to manage IOMMU page table. IOVA-kmalloc is simple to implement with high performance, but suffers from unbounded page table blowup Magazine also shows high performance, however deferring freed IOVA return could be an obstacle.

26 Dynamic Identity Mapping Conflicting Access Permissions Page table entry 0x3000 2 Read Write Must handle with the case that buffers with different permissions stay in a same page. Promoting the permission is prohibited.