<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering

Linux Performance on Large Systems Exadata Hardware How large systems are different Finding bottlenecks Optimizations in Oracle's Unbreakable Enterprise Kernel

Exadata Hardware X2-8 8 Sockets Intel X7560 8 Cores per socket 2 threads per core 1TB of ram 8 IB QDR ports (40Gb/sec each) Other assorted slots, ports, cards

X2-8 NUMA Non Uniform Memory Access X2-8 consists of four blades Each blade has two CPU sockets Each blade has 256GB of ram Each blade has one or more IB cards Fast interconnect to the other blades The CPUs access resources on the same blade much faster than resources on remote blades NUMA lowers hardware costs but increases work that must be done in software to optimize the system Linux already includes extensive optimizations and frameworks to run well on NUMA systems

Finding Bottlenecks Are my CPUs idle? Am I waiting on the disk or the network? Am I bottlenecked on a single CPU? Where is my CPU spending all its time? Application System time (kernel overhead) Softirq processing (kernel overhead) Mpstat -P ALL 1 Gives us a per CPU report of time spent waiting for IO, busy in application or kernel code, doing interrupts etc. Large systems often have a small number of CPUs pegged at 100% while others are mostly idle

Finding Bottlenecks: Latencytop Latencytop Tracks why each process waits in the kernel Can quickly determine if you're waiting on: Disk, network, kernel locks, anything that sleeps GUI mode to select a specific process Latencytop -c mode to collect information on each process over a long period of time

Finding Bottlenecks: perf When the system is CPU bound, perf can tell us why Profiling can be limited to a single CPU Very useful when only one CPU is saturated Profiles can include full back traces Explains the full call chain that leads to lock contention Example usage: Perf record -g -C 16 Record profiles on CPU 16 with call trace Perf record -g -a Record profiles on all CPUs Perf report -g Produce call graph report based on the profile

Optimizing Workloads Fast networking and storage IO rates add contention in new areas Spread interrupts over CPUs local to the cards Push softirq handling out over all the CPUs Reduce lock contention both in the kernel and application Lock contention is much more expensive in NUMA Use cpusets to control CPU allocation to specific workloads

Interrupt Processing Interrupts process events from the hardware Receive network packets Disk IO completion Linux irqbalance daemon spreads interrupt processing over CPUs based on load Irqbalance modifications Only process Irqs on CPUs local to the card Usually hand tuned on NUMA systems, but we added code to do this automatically

Softirqs Softirqs handle portions of the interrupt processing Waking up processes Copying data from the kernel to application memory (networking receives) Various kernel data structure updates Softirqs normally run on the same CPU that received the interrupt, but they run slightly later Spreading interrupt processing across CPUs also spreads the resulting softirq work across CPUs Interrupts must be done on CPUs local to the card for performance, but softirqs can be spread farther away

Spreading Softirqs for Storage IO affinity Records the CPU that issued an IO When the IO completes, the softirq is sent to the issuing CPU Very effective for solid state storage on large systems Reduces contention on scheduler locks because wakeups are done on the same CPU where the process last ran Enabled by default in Oracle's Unbreakable Enterprise Kernel >2x Improvement in SSD IO/s in one OLTP based test Almost 5x faster after removing driver lock contention

Spreading Softirqs for Networking Receive Packet Steering Spreads softirqs for tcpip receives across a mask of CPUs selected by the admin /sys/class/net/xx/queues/rx-n/rps_cpus XX is the network interface N is the queue number (some cards have many) Contains a mask in the taskset format of cpus to use Shotgun style spreading Hash of network headers picks the CPU Fairly random CPU selection for the softirq Not optimal on the x2-8 due to poor locality

Receive Flow Steering Second stage of receive packet steering /sys/class/net/xx/queues/rx-n/rps_flow_cnt Size of the hash table for recording flows (ex 8192) As processes wait for packets the kernel remembers which sockets they are waiting on and which CPU they last used When packets come in the softirq is directed to the CPU where the process last slept More directed than receive packet steering alone Together with receive packet steering: 50% faster ipoib results on a two socket system 100-200% faster on x2-8

RDS Improvements RDS is one of the main network transports used in Exadata systems Reliable datagram services, optimized for Oracle use Enables network RDMA operations when used with Infiniband Original x2-8 target: 4x faster than a two socket system Original x2-8 numbers: slightly slower than a two socket system Final x2-8 numbers: 8x faster than the original two socket numbers

RDS Improvements RDS was heavily saturating one or two cores on the system, but leaving the rest of the x2-8 idle Allocate two MSI irqs for each RDS connection instead of two for the whole system Spreads interrupts across multiple CPUs Reduce lock contention in the RDS code Optimize RDMA key management for NUMA Reduce wakeups on remote CPUs Switch a number of data structures over to RCU Read, copy, update http://lwn.net/articles/262464/

IPC Semaphores Heavily used by Oracle to wakeup processes as database transactions commit Problematic for years due to high spinlock contention inside the kernel Problematic in almost every Unix as well Accounted for 90% of the system time during x2-8 database runs New code doesn't register in system profiles (<1% of the system time)

Cpusets Create simple containers associated with a set of CPUs and memory Can breakup large systems for a number of smaller workloads Example benchmark: High database lock contention on a single row Spreading across all the x2-8 CPUs is much slower than a simple two socket system Containing the workload to 32 CPUs is slightly faster than a simple two socket system (5-10%) http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html

Optimization Summary Include a long series of optimizations between the 2.6.18 and 2.6.32 kernels Many NUMA targeted improvements Focused optimizations for the IO, networking and IPC stacks Extensive profiling with Exadata workloads Work effectively spread across all the CPUs, with less lock contention and system time overhead

2010 Oracle Corporation Resources Linux Home Page oracle.com/linux Follow us on Twitter @ORCL_Linux Free Download: Oracle Linux edelivery.oracle.com/linux Read: Oracle Linux Blog blogs.oracle.com/linux Shop Online Oracle Unbreakable Linux Support oracle.com/store