Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer

TSM Performance Tuning Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer peinkofer@lrz.de

Agenda Network Performance Disk-Cache Performance Tape Performance Server Performance Lessons Learned Additional Resources

Network

The Problem with High-Speed Networks Current Ethernet technology can transfer up to 1.25 GB/s With default settings we cannot saturate a single Gigabit

Tuning Network Settings for Gigabit and Beyond Utilizing (Multi-)Gigabit links requires tuning of: TCP Window size How much can be sent/received before waiting for ACK Maximum Transfer Unit How much can be sent/received per Ethernet frame

TCP Window Size $> cat /etc/sysctl.conf net.ipv4.tcp_rmem = 4096 87389 4194304 net.ipv4.tcp_wmem tcp = 4096 87389 4194304 net.core.rmem_max = 4194304 net.core.wmem e. e _ max = 4194304 Sets a limit of 4MB for the receive and send window TSM option TCP Window size has to be set to 2MB on server and client

Maximum Transfer Unit $> ifconfig ethx mtu XXXX $> cat /etc/sysctl.conf net.ipv4.ip_no_pmtu_disc no pmtu = 0 Set MTU to max supported size Enable MTU path discovery for communication with non-jumbo- Framed hosts Only useful if every intermediate t system supports Jumbo Frames

Measuring the Success IPERF was used to benchmark the network performance http://dast.nlanr.net/projects/iperf

Measuring the Success Server $>iperf s w 1M f M Client $> iperf -c <server> -t 20 -w 1M -f M ------------------------------------- Client connecting to <server>, TCP port 5001 TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte) ------------------------------------------------------------ [ 3] local <IP> port 36484 connected with <IP> port 5001 [ 3] 0.0-20.0 sec 10665 MBytes 533 MBytes/sec

Measuring the Success Influence of TCP Window size on a 10 Gbit Ethernet link

Some Thoughts on Bonding/Trunking Great for high availability Mostly not suitable for increasing performance Single client can utilize a single link only Multiple clients balance across available links only if: Clients and server are in the same subnet or Balancing algorithm uses IP addresses (unlikely) We have to keep in mind that: Switch is responsible for balancing incoming traffic Server is responsible for balancing outgoing traffic

Alternatives to Bonding Use next Ethernet generation Balance manually by using multiple IP s

Disk-Storage Photo from Helmut Payer, gsicom

Main Factors for good Disk-Cache Performance Stripe-Size Locality of disk accesses IO-Subsystem of OS Number of FC-Links utilized in parallel

Stripe-Size Size Rule of thumb: Random IO => Small Stripe-Size Sequential IO => Large Stripe-Size TSM Disk-Cache is rather a sequential IO workload Use Stripe-Size of 512 KB or larger TSM Database is rather a random IO workload IBM recommends Stripe-Size of 256 KB

Locality of Disk Accesses How TSM uses Disk-Cache volumes cannot be influenced How the OS lays out the volumes can be influenced

Locality of Disk Accesses TSM can allocate multiple disk volumes in parallel tsm> DEFINE VOLUME /stg/vol1.dsm FORMATSIZE=16G ANR0984I PROCESS XX for DEFINE VOLUME started...... tsm> DEFINE VOLUME /stg/vol4.dsm FORMATSIZE=16G ANR0984I PROCESS XY for DEFINE VOLUME started... How the volumes are placed on disk depends on the file systemstem

XFS Allocates disk-blocks when file system buffer is flushed Write(1) Write(2) Write(3) Write(4) Write(1) Write(2) Write(3) Write(4) Filesystem Cache Flush Buffers Disk

EXT3 Allocates disk-blocks when data hits the file system buffer Write(1) Write(2) Write(3) Write(4) Write(1) Write(2) Write(3) Write(4) Filesystem Cache Flush Buffers Disk

Comparing EXT3 and XFS XFS has no problems with parallel allocation of Disk-Volumes XFS has a slight weakness with re-write workloads On EXT3, volumes have to be defined one after another

Linux IO-Subsystem Linux s IO-Subsystem is rapidly evolving More and more screws to turn More and more complex to tune

Linux IO-Subsystem Current observation: Write performance OK with default settings Read performance must be tuned by setting read-ahead of block device $> blockdev setra <bytes> <device>

IO Multipathing Typically more than one FC-Link is used for connecting servers to storage for HA reasons Available FC-Links can be used in parallel to gain optimal performance IO-Balancing algorithm depends on IO-Failover driver Configuration for exploiting performance benefit depends on algorithm

IOMP with Qlogic Drivers Qlogic driver supports assignment of individual LUNs to a specific FC-Link Performance per LUN is not increased Resulting configuration: Use at least 2 LUNs per TSM-Instance and stripe them with Software-RAID 0 Use multiple TSM-Instances per server and use dedicated LUNs per instance

Measuring the Success IOZONE was used to benchmark disk performance http://www.iozone.org

Measuring the Success Write file sequential $>iozone -s 10g -r 512k -t 1 -i0 -w Read file sequential $>iozone -s 10g -r 512k -t 1 i1 -w -s 10g : Amount to Write/Read is 10 GB -r 512k: Size of Record to Write/Read is 512 KB -t 1 : Write 1 File in parallel -i 0 1 : Perform Write Perform Read -w : Don t delete Files after benchmark

Comparison of Stripe-Size Size IBM FastT900 with 6 SATA-Disks in a RAID5 volume Workload: Single file sequential read/write

EXT3 Block Allocation IBM FastT900 with 6 SATA-Disks in a RAID5 volume Workload: 12 parallel sequential reads

Comparison of Read-Ahead STK FlexLine 380 with 7 FC-Disks in a RAID5 volume

Tape-Storage

TSM Tape Performance No real influence on tape performance Barely seen 125 MB/s for more than a few seconds with Titanium drives TSM v5.3 on Linux seems not to be ready for current high-end tapes yet Assumption: Some buffers are too small Photo from Sun Microsystems

Server Photo from Helmut Payer, gsicom

Main Factors of Server Performance PCI Bus throughput Memory Bandwidth Number of CPU-Cores Performance of a CPU-Core

PCI Bus Throughput Data travels 4 times over PCI Bus => PCI Bus is main bottleneck PCI-X barely achieves half of the theoretical throughput in typical TSM workloads PCI-Express performs much better because of its switched topology General Rule: Don t try to save money on the peripheral interconnect

Memory Bandwidth As long as DIRECT-IO is not used, data travels 4 times through memory Database operations rely on memory performance too

Number of CPU-Cores Cores TSM is a multi-threaded application The more CPU-cores available the more work can be done in parallel l

Lessons Learned

Tuning Network TCP Window-size: always MTU: if applicable Disk Read-ahead Define Cache-/DB-/Log-volumes sequentially

Criteria for next Servers Have fastest peripheral interconnect available Have 10 Gbit-Ethernet Have at least 4-Gbit FC-HBAs Have at least 4 CPU-Cores Have upper class CPU-Core performance

Additional Resources IBM Tivoli Storage Manager Performance Tuning Guide v5.3 IBM DS4000 Best Practices and Performance Tuning Guide

Thank you for your Attention Any questions? Contact: peinkofer@lrz.de