Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Similar documents
STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

Block Device Driver. Pradipta De

CSE506: Operating Systems CSE 506: Operating Systems

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Modification and Evaluation of Linux I/O Schedulers

Block Device Scheduling. Don Porter CSE 506

I/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Kernel Korner I/O Schedulers

[537] RAID. Tyler Harter

Linux Internals For MySQL DBAs. Ryan Lowe Marcos Albe Chris Giard Daniel Nichter Syam Purnam Emily Slocombe Le Peter Boros

Fine Grained Linux I/O Subsystem Enhancements to Harness Solid State Storage

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Enhancements to Linux I/O Scheduling

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

The Journey of an I/O request through the Block Layer

CSL373: Lecture 6 CPU Scheduling

Computer Systems Assignment 4: Scheduling and I/O

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

ECE 598 Advanced Operating Systems Lecture 22

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

The Btrfs Filesystem. Chris Mason

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

Optimize Storage Performance with Red Hat Enterprise Linux

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Efficient QoS for Multi-Tiered Storage Systems

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Preview. Process Scheduler. Process Scheduling Algorithms for Batch System. Process Scheduling Algorithms for Interactive System

Clustering and Reclustering HEP Data in Object Databases

Operating Systems. Mass-Storage Structure Based on Ch of OS Concepts by SGG

Disks: Structure and Scheduling

EECS 482 Introduction to Operating Systems

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

The Active Block I/O Scheduling System (ABISS)

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

CSE 451: Operating Systems Spring Module 12 Secondary Storage

[537] I/O Devices/Disks. Tyler Harter

Storage Systems. NPTEL Course Jan K. Gopinath Indian Institute of Science

Comp 310 Computer Systems and Organization

Outline. Operating Systems: Devices and I/O p. 1/18

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

I/O Systems and Storage Devices

Improving Disk I/O Performance on Linux. Carl Henrik Lunde, Håvard Espeland, Håkon Kvale Stensland, Andreas Petlund, Pål Halvorsen

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Proceedings of the Linux Symposium

Storage. CS 3410 Computer System Organization & Programming

I/O & Storage. Jin-Soo Kim ( Computer Systems Laboratory Sungkyunkwan University

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

Disk Scheduling COMPSCI 386

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

Chapter 12: Mass-Storage

Data Storage and Query Answering. Data Storage and Disk Structure (2)

High Performance Solid State Storage Under Linux

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Last Class: Processes

Scheduling. Scheduling. Scheduling. Scheduling Criteria. Priorities. Scheduling

Department of Computer Engineering University of California at Santa Cruz. File Systems. Hai Tao

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

Computer Systems Laboratory Sungkyunkwan University

Operating Systems. Lecture Process Scheduling. Golestan University. Hossein Momeni

CSE 153 Design of Operating Systems Fall 2018

IO Priority: To The Device and Beyond

Webinar Series: Triangulate your Storage Architecture with SvSAN Caching. Luke Pruen Technical Services Director

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Evaluating Block-level Optimization through the IO Path

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation

Linux storage system basics

CS2506 Quick Revision

Overview and Current Topics in Solid State Storage

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Operating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France

Chapter 13: Mass-Storage Systems. Disk Scheduling. Disk Scheduling (Cont.) Disk Structure FCFS. Moving-Head Disk Mechanism

Chapter 13: Mass-Storage Systems. Disk Structure

CSE 153 Design of Operating Systems

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

What's New in SUSE LINUX Enterprise Server 9

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Components of the Virtual Memory System

GFS: The Google File System

File System Performance Tuning For Gdium Example of general methods. Coly Li Software Engineer SuSE Labs, Novell.inc

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Overview and Current Topics in Solid State Storage

RAIN: Reinvention of RAID for the World of NVMe

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

arxiv: v1 [cs.os] 24 Jun 2015

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source.

CSCE Operating Systems Scheduling. Qiang Zeng, Ph.D. Fall 2018

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

8: Scheduling. Scheduling. Mark Handley

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Evaluation of Linux I/O Schedulers for Big Data Workloads

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

Transcription:

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Role of the Scheduler Optimise Access to Storage CPU operations have a few processor cycles (each cycle is < 1ns) Seek operations take about ~> 8ms (25 million times longer) Reorder IO requests to minimise seek time Two main operations sorting and merging Sorting arranges block reads sequentially Merging makes two adjacent block reads into one

Where the Scheduler Sits Werner Fischer & Georg Schonberger https://www.thomas-krenn.com/en/wiki/linux_storage_stack_diagram

Available Schedulers Linux Kernel >= 2.4 has 4 schedulers CFQ (default) Deadline Anticipatory Noop

Completely Fair Queueing Each IO device gets its own queue and each queue gets its own timeslice Scheduler reads each queue in round-robin until end of timeslice Reads prioritised over writes

Deadline Three queues Elevator queue containing all sorted read requests Read-FIFO queue containing time ordered read requests Write-FIFO queue containing time ordered write requests Each request in FIFO queue has an expiration time of 500 ms Normally requests are served from the elevator queue, but as requests expire in the FIFO queue, these queues will take precedence Gives good avaerage latency but can reduce global throughput

Anticipatory Similar to deadline scheduler But with some anticipatory knowledge expectation of a subsequent read Read requests serviced within deadline, but then pauses for 6ms, waiting for a subsequent read of the next block In general, this works well if most reads are dependent (e.g. reading sequential blocks in a streaming operation) But may be poor for vector reads

NOOP Simplest scheduler Only does basic merging NO SORTING Recommended for non-seeking devices (e.g. SSDs)

Hardware Set-up at RAL Diskservers configured in RAID6 or RAID60 with 3 partitions Allows for two independent disk failures without losing data Uses hardware RAID controllers Various vendors dependent on procurement Unusual Horizontal striping Minimise RAID storage overhead

RAID Configuration FS1 FS2 FS3 Typical RAID6 Configuration RAL RAID6 Configuration

Impact of Using Hardware RAID RAID controller has fairly big cache Able to write quickly for large periods of time But now CFQ sees only one device (or one for each filesystem) Scheduler sorts iops assuming a well defined disk layout But RAID controller has a different layout and will resort it Not normally a significant overhead, until the iops become large and effectively random Double sorting and different cache sizes between kernel and RAID controller lead to massive inefficiencies Almost all application time spent in WIO

What made RAL look at this

CMS Problems Accessing disk-only storage under LHC Run 2 Load All disk servers (~20) running at 100% WIO Only certain workflows (pileup) Job efficiencies running <30% Other Tier 1 sites also had problems, but were much better (~60% CPU efficiency) Normal CPU efficiencies run >95%

Solutions Limit number of transfers to each disk server Jobs timeout waiting for storage system scheduling Means few jobs run well, but most fail Limit number of pileup jobs on the processing farm No reliable mechanism for identifying them Add more disk servers Would work, but would have cost implications

Finally Try switching scheduler Can be done and reverted on the fly Does not require a reboot Can be applied on a single server to investigate the impact Preemptive CAVEAT Don t try this at home! Software RAID and different RAID controllers will behave differently Only do this if either Your storage system is already overloaded You have tested the impact with your hardware configuration

Anecdotal Impact Switched to NOOP scheduler WIO reduced from 100% to 65% Throughput increased from 50Mb/s to 200Mb/s Based on making a change to 1 server under production workload

Anecdotal Impact

Anecdotal Impact

Attempt to Quantify Heavy WIO reproduced by Multiple jobs reading/writing whole files (1GB random data) Different R/W mixing Using xrootd protocol 200 different files for reading (try to remove any effect of caching reads)

Read Time (sec) Write Time (secs) 50-50 Job Mix 50-50 Job Mix, READ rates 50-50 Job Mix, WRITE rates 250 1200 200 1000 800 150 CFQ 600 CFQ 100 NOOP NOOP 400 50 200 0 10 25 50 100 150 200 0 10 25 50 100 150 200 Number of Jobs Number of Jobs

NOOP CFQ 50-50 Job Mix (100 jobs)

secs MB/s Read Dominated (80-20) Job Mix 80-20 Job Mix, READ rates 80-20 Job Mix, WRITE rates 250 1200 200 1000 800 150 CFQ 600 CFQ 100 NOOP NOOP 400 50 200 0 10 25 50 100 150 200 0 10 25 50 100 150 200 Number of Jobs Number of Jobs

NOOP CFQ Read Dominated Job Mix

secs secs Write-Dominated (20-80) Job Mix 20-80 Job Mix, READ rates 20-80 Job Mix, WRITE rates 50 1200 45 40 1000 35 800 30 25 CFQ 600 CFQ 20 NOOP NOOP 15 400 10 200 5 0 10 25 50 100 150 200 Number of Jobs 0 10 25 50 100 150 200 Number of Jobs

NOOP CFQ Read Dominated Job Mix

Summary & Conclusion Attempted controlled tests showed no significant benefit of using NOOP or CFQ schedulers ATLAS problems bear this out or to spin results positively There is no detriment observed to using NOOP over CFQ Left with the question what is it about the pile-up jobs that causes such a bad impact on the disk servers, and why does NOOP give such a big gain Questions Because I still have many!