Host Solutions Group Technical Bulletin August 30, PDF Free Download

Summary ISCSI PERFORMANCE CONSIDERATIONS Host Solutions Group Technical Bulletin August 30, 2007 Meeting throughput and response time requirements in iscsi SANs requires considering both component and system level configuration issues. This paper addresses how to optimize iscsi performance when using QLogic 4000 Series iscsi HBAs and ISP40xx ASICs in target applications. Three areas are covered: General network design considerations which apply to any Ethernet network, Configurable TCP/IP parameters which can be used to address congestion and packet loss and lastly specific QLogic HBA considerations and configurable parameters which can be used to address performance issues. INTRODUCTION...2 ETHERNET NETWORK DESIGN...2 NETWORK INFRASTRUCTURE BANDWIDTH...2 BUFFERING IN THE NETWORK...2 LARGE MSS...3 ETHERNET PAUSE...3 ADVANCED NETWORK CONFIGURATION...3 TCP PROTOCOL OPTIONS...3 TCP WINDOW SIZE/WINDOW SCALE OPTION...3 TCP TIMESTAMP...4 TCP RENO AND NEW-RENO...4 TCP SACK...4 PERFORMANCE OPTIMIZATION...5 MAXIMIZING THROUGHPUT...5 MINIMIZING RESPONSE TIME...5 SURVIVING THE LOSS OF A PACKET...5 FIRMWARE CONFIGURATION OPTIONS...6 TCP CONFIGURATION OPTIONS...7 ISCSI CONFIGURATION OPTIONS...7 TIMER CONFIGURATION...8 HBA (ASIC) SPECIFIC CONSIDERATIONS...8 QLA4010 HBA (ISP4010 ASIC)...8 QLA405X HBA (ISP4022 ASIC)...9 QLE406X HBA (ISP4032 ASIC)...9 TBU07008 - iscsi Performance Considerations Page 1 of 10

Introduction Maximizing ISCSI SAN performance is a complex endeavor. High throughput is expected, as is a short response time for each I/O. Maximum throughput is ultimately a function of the physical media. The 1-Gigabit Ethernet implemented by the QLogic 4000 Series HBAs (and ISP4xxx ASICs) typically tops out at a little more than 117 Mbytes per second. Some of the bandwidth is consumed by protocol overhead, and the maximum ISCSI data throughput measures at about 110 Mbytes per second. Response time is a function of packet loss and network latency. Packet loss necessitates recovery by the TCP protocol. The recovery process is not fast, and by design can sacrifice performance on the connection in recovery in deference to the health of the network in general. Network latency can be introduced by traversing WAN environments or through network components or end nodes that are overwhelmed with traffic. Network latency is an issue on its own, in addition to complicating recovery from packet loss. The QLogic 4000 series HBAs can transmit data at rates that push the limits of even well-designed networks. An environment with multiple iscsi initiators and multiple iscsi targets will likely encounter some network congestion and/or packet loss. It is the responsibility of the TCP protocol to respond to network congestion and to recover from packet loss. Recovery from packet loss impacts throughput and lengthens response time for an IO. Haphazard or poor network design and/or network components can cause packet loss and increase network latency. End nodes that are poorly configured for conditions expected on the network cannot maintain high levels of throughput or response time requirements. Ethernet Network Design This section discusses some options available to the network designer to deal with the problems of packet loss and latency that occur in a congested and/or oversubscribed network. Network Infrastructure Bandwidth Network designs must take into account the bandwidth requirements of the applications that run over the network. For networks that include iscsi storage, the network infrastructure must take into account the bandwidth requirements of all iscsi initiators and targets. Switches should be evaluated for proper backplane bandwidth. Inter-Switch Links (ISLs) may require multi-gigabit bandwidth. This can be accomplished through the use of links with greater than 1-Gigabit capacity or by using trunk protocols to aggregate 1-Gigabit links. Buffering in the Network Regardless of the bandwidth available in a network infrastructure, there may exist a many-to-one problem at the final egress port of the network. For example, a single iscsi initiator may issue large READ operations to multiple targets. Those targets may simultaneously issue large iscsi Data-In PDUs in response. If the network can deliver both data streams, then those TCP Segments may arrive at the initiator port simultaneously. Since it is impossible to deliver the two 1- Gigabit streams down a 1-Gigabit link to the initiator, the packets will either be buffered or dropped. Even without multiple iscsi streams, packet loss due to oversubscription can occur due to any of the following: TBU07008 - iscsi Performance Considerations Page 2 of 10

Links that include VLAN tags will have lower TCP data throughput than links that don t include such data. A continuous 1-Gigabit data stream without VLAN tags cannot traverse a 1Gigabit link that includes VLAN tags without eventually oversubscribing the tagged links. Switch and Router protocols that issue Broadcast or Multicast packets can cause a fully subscribed link to drop traffic. (STP, RIP, OSPF, Switch Notification Protocols, ) Other protocols on the end-node that share the same 1-Gigabit Ethernet port (ARP, DNS, Microsoft Browser protocols, the list is endless ) Some Ethernet switches implement better buffer schemes than others. While it isn t usually possible or practical to configure buffering in a switch, it can help to know a switch buffer capacity when designing a network for iscsi storage. Buffers in the switch can help the network tolerate short bursts of traffic that could temporarily oversubscribe a link and drop packets. However, if the traffic stream is continuously oversubscribing a link, the switch will eventually have to drop traffic. Large MSS Large MSS values can improve TCP throughput. If the networking equipment supports jumbo frames, configuring a large Ethernet MTU can improve throughput and improve recovery of lost data. This is because recovering one dropped frame is faster than recovery from a series of dropped frames. This is another case where the network administrator must understand how the behavior of the network equipment. For instance, a network switch may more readily drop Jumbo Frames than 1500 byte frames resulting in more recovery events. Also, if the switch is a store and forward device enabling jumbo frames will increase network latency. Lastly, enabling jumbos may require utilizing TCP Window scaling to account for the increased Round Trip Time caused by jumbo frames. Ethernet Pause Enabling Ethernet Pause is a big-hammer approach to solving the problem of network congestion and packet loss. When implemented throughout a network infrastructure, it does solve the packet loss problem. Before enabling Ethernet Pause the network administrator must understand the effect of Ethernet Pause on the network. The network administrator must also know whether all the network components support Ethernet Pause and how the various network components will react to Ethernet Pause. Advanced Network Configuration Many Ethernet Switch and Router manufacturers implement advanced capabilities for analyzing and prioritizing traffic and for allocating and reserving bandwidth in their network devices. Consult with your network equipment provider for specific information regarding advanced features. TCP Protocol Options The TCP protocol implements many features to avoid congestion and to recover from packet loss. This section highlights options that are pertinent to this issues. For additional information, the TCP protocols and options are well documented in the TCP/IP Illustrated books, the RFCs and other places. TCP Window Size/Window Scale Option The TCP Receive Window size of a connection endpoint is advertised in each TCP packet sent on the connection. RFC 1323 defines the use of the TCP Window Scale option. TCP Window Scale is negotiated during TCP Connection Establishment and is used to increase the possible TCP Window size beyond the limit imposed by the 16-bit field in the TCP packet header. TBU07008 - iscsi Performance Considerations Page 3 of 10

Most often, TCP Window scale is used to increase the amount of allowed unacknowledged data outstanding on a connection when latency is introduced into a network. Allowing more data in the pipe utilizes available bandwidth more effectively. Given a fixed TCP Window size, increasing the latency in a network can reduce the throughput possible on a TCP connection. Refer to the following table for a rough guide to the minimum TCP window size required to maintain full throughput on various bandwidth links when network latency is increased. Round Trip Time (msec) TCP Window Size (K Bytes) 100MBit 1-Gigabit 10-Gigabit 0.5 7 63 625 1 13 125 1,250 2 25 250 2,500 5 63 625 6,250 10 125 1,250 12,500 20 250 2,500 25,000 40 500 5,000 50,000 80 1,000 10,000 100,000 100 1,250 12,500 125,000 Figure 1: TCP Window Size Configuration Guidelines Larger TCP window values increase the amount of data outstanding, which means that it also increases the amount of data which can be dropped. For this reason, latency and packet loss should be taken into consideration when determining the TCP window values to use. TCP Timestamp The TCP retransmission algorithms work best when the round-trip time is accurate. The TCP Timestamp option enables the TCP protocol to calculate round-trip times more accurately and more frequently. A more accurate round-trip time can help the TCP engine recover from packet loss more efficiently. The TCP Timestamp option is negotiated between both ends of the TCP connection during connection establishment. Both sides must agree to use the TCP Timestamp option or neither side can use it. Environments that do not enable TCP Timestamp should not expect optimal recovery from packet loss situations. TCP Reno and New-Reno QLA4010 (ISP4010) Implements the Reno algorithms defined in RFC 2581. QLA405x (ISP4022) and QLE406x (ISP4032) Implement the New-Reno algorithm. The TCP New-Reno algorithm defined in RFC 3782 allows for quicker recovery from some packet loss situations. Either end of the TCP connection can use the New-Reno algorithm independent of the other. It is not negotiated. TCP SACK The TCP Selective Acknowledgement (SACK) algorithm defined in RFC 2018 allows an option to the TCP protocol to report multi-segment gaps in a TCP data stream. This allows TCP to fill the gap created by a burst packet loss more quickly than without TCP SACK. The TCP SACK option TBU07008 - iscsi Performance Considerations Page 4 of 10

is negotiated between both ends of the TCP Connection. None of the QLogic ISP40xx ASICs implement the TCP SACK option. Performance Optimization Maximizing Throughput It is not always possible to achieve maximum throughput with small IO operations. It is also not always possible to achieve maximum throughput without multiple simultaneous IO operations. The following is a checklist of items for ensuring the maximum throughput possible. Note that this list may conflict with the packet-loss minimizing checklist. Drive packet loss out of your network Ensure the TCP Window is large enough to fill the complete Round Trip Time (RTT) Ensure that there are multiple IO operations outstanding on the connection Ensure that the total size of all the outstanding IO operations is large enough to fill the TCP window. If write throughput is important, ensure that immediate data is enabled Try to use large PDU and burst sizes. Remember that sometimes the data outstanding for an IO is limited to the Maximum Burst size and not the total IO size. Ensure that the driver is using the fewest number of scatter/gather elements to define the IO Minimizing Response Time The following is a checklist of items for ensuring the minimum response time. Note that this list may conflict with the packet-loss minimizing checklist. Follow the checklist for ensuring maximum throughput Try to use PDU and burst sizes that are at least as large as the IO Surviving the Loss of a Packet The first step in working on a packet loss problem is to determine where the packet was dropped. If the packet is dropped within the network infrastructure, then you should look for network design solutions to the bandwidth problems in your infrastructure. Sometimes, despite the best efforts of a network infrastructure, packets are still lost. After designing and tuning your network, you may still experience connection drops due to packet loss situations. The following checklist may help relieve the problems: Ensure the TCP Timestamp option is enabled on iscsi connections Ensure that the endpoints of the connection are not dropping packets due to a low performance NIC, PCI bandwidth limitations, or processor speed and capabilities Configure the ISP40xx timers to allow for maximum recovery time Allow the maximum time for each IO If these measures are not successful in limiting connection failures due to packet loss, you may have to resort to measures that could limit the effective bandwidth in networks with non-negligible latency. The following items are a result of the theory that the TCP recovery process for small TBU07008 - iscsi Performance Considerations Page 5 of 10

amounts of lost data will be faster than the recovery process for large amounts of lost data. You may have to resort to the following: Try to limit the amount of data that might be dropped by using a smaller TCP Window Force the ISP40xx to disable TCP Window Scaling to limit the TCP send window to (64K 1) Bytes If all else fails, these steps should be considered: Enable Ethernet Pause to limit packet loss in the network Try to limit drops in the network by using switches with large buffers Consult a Network Design Expert to design a network that limits packet loss Firmware Configuration Options The following configuration options are available to help in tuning for effective performance. The following firmware initialization parameters are configurable through the SANsurfer GUI: Immediate Data Enable (Default setting = Enabled) Device Timeout Enable (Controls all IO Timeout processing) Max Burst Size First Burst Size Delayed ACK Enable Connection Keep-alive Timeout (NOP Timer) TCP Timestamp Option Default Timeout (NOP Response Timeout and others) Delayed ACK Enable (4010, 406x) TCP Window Size\ TCP Configuration (406x Only) HBA Default Fragment Reassembly Timeout (406x Only) The following DDB configuration parameters are configurable through the SANsurfer GUI: Immediate Data Enable (Default setting = Enabled) Max Burst Size First Burst Size Delayed ACK Enable (405x Only) Connection Keep-alive Timeout (NOP Timer) TCP Timestamp Option Default Timeout (NOP Response Timeout and others) TCP Window Scale Factor Disable TCP Window Scale Option TBU07008 - iscsi Performance Considerations Page 6 of 10

TCP Configuration Options TCP Window Size & Window Scale Option All 4000 Series HBAs (ISP40xx ASICs) TCP Window size is configured globally on the ISP40xx products. The value configured during the initialization process affects all TCP connections on the ISP40xx. TCP Window Scale factor is configured per connection on ISCSI Initiator connections. The default receive window scale is 0. By default, the 4000 Series HBA firmware will always respond to a TCP SYN containing the Window Scale option with a TCP SYN-ACK containing a Window Scale option, even if the window scale value is 0. This always allows the connection peer to scale its receive window. QLA4010 (ISP4010) The default receive window size is 16K. QLA405x (ISP4022) and QLE406x (ISP4032) The default receive window size is 32K. TCP Timestamp The 4000 Series HBAs implement the TCP Timestamp option. The firmware enables TCP Timestamp on all connections by default. Delayed TCP Acknowledgment The QLA40xx HBAs implement an option to disable Delayed TCP Acknowledgements. In some cases, disabling delayed TCP acknowledgements can speed TCP recovery times or decrease IO Response time. In many cases, disabling delayed TCP acknowledgements can adversely affect overall network performance due to the fact that it can flood the network with additional TCP ACK packets. QLA4010 (ISP4010) The QLA4010 HBA s implementation of the option to disable Delayed TCP Acknowledgements is a firmware parameter that will disable Delayed TCP Acknowledgements for the all TCP communication through the HBA. QLA405x (ISP4022) The QLA405x HBA s implementation of the option to disable Delayed TCP Acknowledgements is controlled per target connection. There is a Delayed TCP Acknowledgement DDB configuration parameter that disables Delayed TCP Acknowledgements for all TCP communication for that target connection. There is also a Delayed TCP Acknowledgement firmware configuration parameter for the QLA405x HBA, but in this case this firmware parameter defines what the default Delayed TCP Acknowledgement parameter value will be for all new DDBs. QLE406x (ISP4032) The QLE406x HBA s implementation of the option to disable Delayed TCP Acknowledgements is a firmware parameter that will disable Delayed TCP Acknowledgements for the all TCP communication through the HBA. TBU07008 - iscsi Performance Considerations Page 7 of 10

ISCSI Configuration Options ISCSI PDU and Burst Sizes The ISCSI PDU and Burst Sizes don t have much of an effect on throughput, as long as the PDU size is 64K or larger and the burst size is large enough to contain an IO. Timer Configuration IO Timers The 4000 Series HBA firmware times I/O operations. The system driver specifies a timeout value, in seconds, for each I/O operation. Most Operating Systems pass a timeout value to the driver. Some Operating Systems provide a method to specify the I/O timeout period via a system parameter. This value should be carefully considered by the driver to allow as much time as possible for recovery from packet loss situations, while still satisfying the requirements for the system. Initiator drivers may have timeout values forced upon them by the system. NOP Timer The ISP40xx Session-Mode firmware implements a configurable timer for issuing ISCSI NOP PDUs. The iscsi NOP Timer is a field of the Device Database Entry. This timer should be configured to be at least as long as the IO timeout value. When using Connection-Mode, the NOP timer is handled in host software. The NOP timer is reset at the completion of an ISCSI command or burst. If an active IO is in the middle of recovery from a packet loss situation, the NOP timer can expire and cause a NOP to be issued. Since the connection is in the middle of recovery, the NOP or the response from the peer will be queued behind the data to be recovered. In this situation, it is not desirable to time the NOP for a shorter time period than the active IO. NOP Response Timeout In addition to the NOP Timer determining when to send a NOP, there is a separate timer for timing the response to the NOP. If a response to a NOP is not received within the timeout period, the firmware will reset the TCP connection. The NOP Response Timeout for ISCSI Session-Mode Initiator devices is configured for each DDB using the Default Timeout field. We recommend setting the NOP Response Timeout value to be at least as long as the longest IO timer Abort Timer The ISP40xx firmware implements an Abort Timer for timing Task Management commands issued by the host driver. The default value for the Abort Timer is currently 3 seconds. If this timer is active and it expires, the ISP40xx firmware will reset the TCP connection. Three seconds is too short in environments that include packet loss. The driver should carefully consider the system requirements for task management commands and configure the Abort Timer appropriately. HBA (ASIC) Specific Considerations QLA4010 HBA (ISP4010 ASIC) The ISP4010 may not be able to recover quickly in environments that include frequent packet loss. It is best to avoid losing any amount of data from an ISP4010. QLogic Corporation specifies that all connections to an ISP4010 use the TCP Timestamp option. In addition, Ethernet Pause will not be effective with the QLA4010. TBU07008 - iscsi Performance Considerations Page 8 of 10

QLA405x HBA (ISP4022 ASIC) The New-Reno algorithm in the ISP4022 makes it possible to recover from packet loss faster than the Reno algorithm in the ISP4010. At most, the ISP4022 will recover 1 packet per round trip time. The maximum allowed response time divided by the network round trip time will determine the maximum number of packets that can be recovered and still meet the response time requirements. QLogic Corporation specifies that all connections to an ISP4022 use the TCP Timestamp option for proper recovery from lost packets. QLE406x HBA (ISP4032 ASIC) The ISP4032 implements the New-Reno algorithm and in addition includes a TCP configuration register that allows for some performance tuning of the TCP recovery algorithm which makes it possible for the QLE406x HBAs to recover faster than the QLA405x HBAs (ISP4022 ASIC). Designs based on the ISP4032 ASIC will have the best performance in networks where packet loss is occurring. The defaults values for the parameters are as follows: At Reset: 0x0001_314C PCI-CRC (T10) Disabled: 0xBC01314C PCI-CRC (T10) Enabled: 0xC401314C DEG IR DPR MPR ICW CSS unused ACK Freq 31 27 23 20 19 16 DAR Count Init Congestion Win ReTx Warn Max ReTx 15 11 7 3 0 31 DEG Disable Exponential Growth of the retransmission timer if timestamps are disabled and forward progress is being made on the connection. 30 IR Immediately Retransmit. If following a timer expiration retransmission an ACK is received that acknowledges new data but doesn t get the receiver caught up, immediately retransmit the missing segment. This bit can NOT be used if the MPR bit is set. 29 DPR Don t send Previously Retransmitted packet until the retransmission timer expires again. When set in conjunction with the MPR bit only retransmit an individual segment once between retransmission timer expirations. Can only be used if the MPR bit is set. 28 MPR Multiple Packet Retransmission. Use the normal CWND algorithm on retransmissions to determine how many packets could be sent. Retransmissions can occur immediately after previously retransmitted data is ACKed. This bit can NOT be used if the IR bit is set. This bit can not be set if any connection is operating with PCI-CRC. TBU07008 - iscsi Performance Considerations Page 9 of 10

27 ICW Use the Init Congestion Win value on retransmission time outs. Can only be used if the MPR bit is set. 19:16 ACK Freq Number of Segments to receive before sending ACK. The value programmed into this register is 1 less than the number of segments to receive. Current BSD value = 2, Min = 0, Max = 15. 15:12 DAR Count Number of Duplicate ACKs to receive before entering Fast Recovery. Current BSD value = 3, Min = 1, Max = 14. 11:8 Init Congestion Win Multiples of MSS to set the Congestion Window to when entering slow start. Note: When a retransmission time out occurs the Congestion Window will be set to 1 MSS. Current BSD value = 1, Min = 1, Max = 15. The disable exponential growth (DEG) bit provides significant improvement in recovery time for those situations where timestamps are off and multiple packets within the send window are repeatedly lost. For example in the situation where a network is congested and multiple packets are lost, and then it continues to be congested even during recovery. With this bit set the 4032 will still perform exponential back-off when retransmitting individual packets multiple times, but will reset the back-off even when retransmitted data has been acknowledged. This could be considered at odds with TCP RFCs and thus is configurable. The multiple-packet retransmission (to be used in conjunction with the DPR bit) allows for faster TCP recovery on burst losses. Without setting this bit the ISP4032 will only retransmit a packet at a time every time it enters recovery (from either a timer pop, a duplicate-ack retransmission or a New Reno retransmission), but if this bit is set the ISP4032 will send up to a window s worth of data during recovery (a window being the minimum of the TCP window or internally maintained congestion window). The MPR bit is not compatible with PCI-CRC (T10). If any TCP connection is going to send PCI-CRC protected data MPR and DPR must not be set, use IR instead. Since the congestion window is reset to the value in Init Congestion Window whenever the retransmit timer pops, and many TCP stacks delay acks to single TCP packets, recovery performance can be improved by setting the Init Congestion Window to be two or more packets and setting the ICW bit. This can also help improve I/O latency in those situations where data hasn t been sent for some time and we ve re-entered TCP slow start by helping induce the other TCP stack to ack immediately instead of delaying the ack to the initial transmission. TBU07008 - iscsi Performance Considerations Page 10 of 10