<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

Similar documents
Inside look at benchmarks Wim Coekaerts Senior Vice President, Linux and Virtualization Engineering. Wednesday, August 17, 11

Advanced Computer Networks. End Host Optimization

Linux multi-core scalability

Oracle Linux Wim Coekaerts, Senior Vice President, Linux and Virtualization Engineering

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Measuring the impacts of the Preempt-RT patch

Multiprocessor Systems. Chapter 8, 8.1

AutoNUMA Red Hat, Inc.

Oracle Exalogic Elastic Cloud Overview. Peter Hoffmann Technical Account Manager

Oracle Enterprise Architecture. Software. Hardware. Complete. Oracle Exalogic.

Realtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams

Frits Hoogland - Oracle Usergroup Norway 2013 EXADATA AND OLTP. Thursday, April 18, 13

Performance Optimisations for HPC workloads. August 2008 Imed Chihi

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

CSE 120 Principles of Operating Systems

Scheduling. Don Porter CSE 306

Mark Falco Oracle Coherence Development

ò Paper reading assigned for next Tuesday ò Understand low-level building blocks of a scheduler User Kernel ò Understand competing policy goals

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

IBM POWER8 100 GigE Adapter Best Practices

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

OpenVMS Scaling on Large Integrity Servers. Guy Peleg President Maklee Engineering

Best Practices for Setting BIOS Parameters for Performance

CFS-v: I/O Demand-driven VM Scheduler in KVM

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Architectural Principles for Networked Solid State Storage Access

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

NUMA replicated pagecache for Linux

Running High Performance Computing Workloads on Red Hat Enterprise Linux

Multiprocessor Systems. COMP s1

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Virtual SQL Servers. Actual Performance. 2016

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

HLD For SMP node affinity

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Private Cloud Database Consolidation Name, Title

IsoStack Highly Efficient Network Processing on Dedicated Cores

RxNetty vs Tomcat Performance Results

Real Time Linux patches: history and usage

Background: I/O Concurrency

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Performance Tuning Guidelines for Low Latency Response on AMD EPYC -Based Servers Application Note

Optimizing Fusion iomemory on Red Hat Enterprise Linux 6 for Database Performance Acceleration. Sanjay Rao, Principal Software Engineer

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Motivation of Threads. Preview. Motivation of Threads. Motivation of Threads. Motivation of Threads. Motivation of Threads 9/12/2018.

Oracle Database 11g Direct NFS Client Oracle Open World - November 2007

Light & NOS. Dan Li Tsinghua University

EXPERIENCES WITH NVME OVER FABRICS

Was ist dran an einer spezialisierten Data Warehousing platform?

OneCore Storage Performance Tuning

Exadata. Presented by: Kerry Osborne. February 23, 2012

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Review. Preview. Three Level Scheduler. Scheduler. Process behavior. Effective CPU Scheduler is essential. Process Scheduling

Linux Kernel Architecture

A Transport Friendly NIC for Multicore/Multiprocessor Systems

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Multifunction Networking Adapters

Software and Tools for HPE s The Machine Project

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

InfiniBand Networked Flash Storage

When the OS gets in the way

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Scheduling. CS 161: Lecture 4 2/9/17

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Building a High IOPS Flash Array: A Software-Defined Approach

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Operating System Design Issues. I/O Management

Oracle Exadata: Strategy and Roadmap

2009. October. Semiconductor Business SAMSUNG Electronics

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

<Insert Picture Here> Exadata Hardware Configurations and Environmental Information

URDMA: RDMA VERBS OVER DPDK

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Profiling: Understand Your Application

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

Kernel Korner What's New in the 2.6 Scheduler

08:End-host Optimizations. Advanced Computer Networks

Jackson Marusarz Intel Corporation

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

LCA14-104: GTS- A solution to support ARM s big.little technology. Mon-3-Mar, 11:15am, Mathieu Poirier

Preview. The Thread Model Motivation of Threads Benefits of Threads Implementation of Thread

Testing 6x DS-CAM-600. Gigabit-Ethernet Camera

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Measuring a 25 Gb/s and 40 Gb/s data plane

Innodb Performance Optimization

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Filesystem Performance on FreeBSD

CSCE Operating Systems Scheduling. Qiang Zeng, Ph.D. Fall 2018

YOUR machine and MY database a performing relationship!? (#141)

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

Transcription:

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering

Linux Performance on Large Systems Exadata Hardware How large systems are different Finding bottlenecks Optimizations in Oracle's Unbreakable Enterprise Kernel

Exadata Hardware X2-8 8 Sockets Intel X7560 8 Cores per socket 2 threads per core 1TB of ram 8 IB QDR ports (40Gb/sec each) Other assorted slots, ports, cards

X2-8 NUMA Non Uniform Memory Access X2-8 consists of four blades Each blade has two CPU sockets Each blade has 256GB of ram Each blade has one or more IB cards Fast interconnect to the other blades The CPUs access resources on the same blade much faster than resources on remote blades NUMA lowers hardware costs but increases work that must be done in software to optimize the system Linux already includes extensive optimizations and frameworks to run well on NUMA systems

Finding Bottlenecks Are my CPUs idle? Am I waiting on the disk or the network? Am I bottlenecked on a single CPU? Where is my CPU spending all its time? Application System time (kernel overhead) Softirq processing (kernel overhead) Mpstat -P ALL 1 Gives us a per CPU report of time spent waiting for IO, busy in application or kernel code, doing interrupts etc. Large systems often have a small number of CPUs pegged at 100% while others are mostly idle

Finding Bottlenecks: Latencytop Latencytop Tracks why each process waits in the kernel Can quickly determine if you're waiting on: Disk, network, kernel locks, anything that sleeps GUI mode to select a specific process Latencytop -c mode to collect information on each process over a long period of time

Finding Bottlenecks: perf When the system is CPU bound, perf can tell us why Profiling can be limited to a single CPU Very useful when only one CPU is saturated Profiles can include full back traces Explains the full call chain that leads to lock contention Example usage: Perf record -g -C 16 Record profiles on CPU 16 with call trace Perf record -g -a Record profiles on all CPUs Perf report -g Produce call graph report based on the profile

Optimizing Workloads Fast networking and storage IO rates add contention in new areas Spread interrupts over CPUs local to the cards Push softirq handling out over all the CPUs Reduce lock contention both in the kernel and application Lock contention is much more expensive in NUMA Use cpusets to control CPU allocation to specific workloads

Interrupt Processing Interrupts process events from the hardware Receive network packets Disk IO completion Linux irqbalance daemon spreads interrupt processing over CPUs based on load Irqbalance modifications Only process Irqs on CPUs local to the card Usually hand tuned on NUMA systems, but we added code to do this automatically

Softirqs Softirqs handle portions of the interrupt processing Waking up processes Copying data from the kernel to application memory (networking receives) Various kernel data structure updates Softirqs normally run on the same CPU that received the interrupt, but they run slightly later Spreading interrupt processing across CPUs also spreads the resulting softirq work across CPUs Interrupts must be done on CPUs local to the card for performance, but softirqs can be spread farther away

Spreading Softirqs for Storage IO affinity Records the CPU that issued an IO When the IO completes, the softirq is sent to the issuing CPU Very effective for solid state storage on large systems Reduces contention on scheduler locks because wakeups are done on the same CPU where the process last ran Enabled by default in Oracle's Unbreakable Enterprise Kernel >2x Improvement in SSD IO/s in one OLTP based test Almost 5x faster after removing driver lock contention

Spreading Softirqs for Networking Receive Packet Steering Spreads softirqs for tcpip receives across a mask of CPUs selected by the admin /sys/class/net/xx/queues/rx-n/rps_cpus XX is the network interface N is the queue number (some cards have many) Contains a mask in the taskset format of cpus to use Shotgun style spreading Hash of network headers picks the CPU Fairly random CPU selection for the softirq Not optimal on the x2-8 due to poor locality

Receive Flow Steering Second stage of receive packet steering /sys/class/net/xx/queues/rx-n/rps_flow_cnt Size of the hash table for recording flows (ex 8192) As processes wait for packets the kernel remembers which sockets they are waiting on and which CPU they last used When packets come in the softirq is directed to the CPU where the process last slept More directed than receive packet steering alone Together with receive packet steering: 50% faster ipoib results on a two socket system 100-200% faster on x2-8

RDS Improvements RDS is one of the main network transports used in Exadata systems Reliable datagram services, optimized for Oracle use Enables network RDMA operations when used with Infiniband Original x2-8 target: 4x faster than a two socket system Original x2-8 numbers: slightly slower than a two socket system Final x2-8 numbers: 8x faster than the original two socket numbers

RDS Improvements RDS was heavily saturating one or two cores on the system, but leaving the rest of the x2-8 idle Allocate two MSI irqs for each RDS connection instead of two for the whole system Spreads interrupts across multiple CPUs Reduce lock contention in the RDS code Optimize RDMA key management for NUMA Reduce wakeups on remote CPUs Switch a number of data structures over to RCU Read, copy, update http://lwn.net/articles/262464/

IPC Semaphores Heavily used by Oracle to wakeup processes as database transactions commit Problematic for years due to high spinlock contention inside the kernel Problematic in almost every Unix as well Accounted for 90% of the system time during x2-8 database runs New code doesn't register in system profiles (<1% of the system time)

Cpusets Create simple containers associated with a set of CPUs and memory Can breakup large systems for a number of smaller workloads Example benchmark: High database lock contention on a single row Spreading across all the x2-8 CPUs is much slower than a simple two socket system Containing the workload to 32 CPUs is slightly faster than a simple two socket system (5-10%) http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html

Optimization Summary Include a long series of optimizations between the 2.6.18 and 2.6.32 kernels Many NUMA targeted improvements Focused optimizations for the IO, networking and IPC stacks Extensive profiling with Exadata workloads Work effectively spread across all the CPUs, with less lock contention and system time overhead

2010 Oracle Corporation Resources Linux Home Page oracle.com/linux Follow us on Twitter @ORCL_Linux Free Download: Oracle Linux edelivery.oracle.com/linux Read: Oracle Linux Blog blogs.oracle.com/linux Shop Online Oracle Unbreakable Linux Support oracle.com/store