Keeping up with the hardware

Size: px

Start display at page:

Download "Keeping up with the hardware"

Antonia Holland
5 years ago
Views:

1 Keeping up with the hardware Challenges in scaling I/O performance Jonathan Davies XenServer System Performance Lead XenServer Engineering, Citrix Cambridge, UK 18 Aug 2015 Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

2 Outline 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

3 Outline The virtualisation performance challenge 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

4 The virtualisation performance challenge Recent hardware trends 100 Gb/s 40 Gb/s 10 Gb/s NICs speed (log scale) 1 Gb/s CPUs disks NVMe HDD SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

5 The virtualisation performance challenge Virtualisation overhead is increasing As I/O devices get faster but CPU speeds remain constant, this means the relative virtualisation overhead increases: Old I/O devices time spent on physical device overhead Modern I/O devices overhead Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

6 Outline Networking performance 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

7 Networking performance Areas of weak networking performance Metric Xen s performance Intrahost VM-to-VM throughput weak Intrahost aggregate throughput weak Interhost from-vm transmit throughput strong Interhost into-vm receive throughput weak Interhost aggregate throughput strong Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

8 Outline Networking performance Improving intrahost single-stream throughput 1 The virtualisation performance challenge 2 Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

9 Where do we stand? Networking performance Improving intrahost single-stream throughput Intrahost VM-to-VM single-stream throughput measurements (using CentOS 7): XenServer Gb/s Target more is better 30 Gb/s Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

10 Networking performance Improving intrahost single-stream throughput It s even worse with an upstream guest kernel! Intrahost VM-to-VM single-stream throughput measurements (using CentOS 7): XenServer Gb/s (guests with 4.0 kernel) 9 Gb/s Target more is better 30 Gb/s Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

11 Networking performance Improving intrahost single-stream throughput Datapath analysis with 4.0 kernel in guests tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront lling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy nished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback lling frags tx netback dequeued skb from tx_queue tx netback nished gntmap tx netback nished gntcpy tx netback nished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from rst tx slot tx netfront written to last tx slot tx netfront written to rst tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

12 Networking performance Improving intrahost single-stream throughput Datapath analysis with 4.0 kernel in guests tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

13 Networking performance Improving intrahost single-stream throughput Transmitter often stalls; only ever two packets in flight tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Red boxes: periods when netfront is not running Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

14 Networking performance Improving intrahost single-stream throughput Principal bottleneck: high TX completion latency High TX completion latency is a serious problem with guests using 4.x kernels, which aggressively limit the amount of uncompleted data. Definition of TX completion latency TX completion latency time skb generated by guest request put in TX ring request consumed by dom0 response received in TX ring Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

15 Networking performance Improving intrahost single-stream throughput The transmitter waits for TX completion tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Yellow slice: point of TX completion Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

16 Networking performance Improving intrahost single-stream throughput Principal bottleneck: high TX completion latency Idea to reduce TX completion latency 1 Pretend TX completion happens after netback consumes the request. This can be done using skb_orphan, which decouples freeing from skb accounting Rationale: On physical NIC drivers, TX completion occurs when the packet has hit the wire, not when it has gone into the receiver s queue. effective TX completion latency time skb generated by guest request put in TX ring request consumed by dom0 orphan the skb? response received in TX ring Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

17 Networking performance Improving intrahost single-stream throughput Datapath analysis with 3.18 kernel in guests tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Two CentOS 7.0 VMs ( kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

18 Networking performance Improving intrahost single-stream throughput The main problem is still TX completion latency tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Red boxes: periods when netfront is not running Two CentOS 7.0 VMs ( kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

19 Networking performance Improving intrahost single-stream throughput Next bottleneck: NAPI CPU utilisation tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Red boxes: periods when NAPI is not running Two CentOS 7.0 VMs ( kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

20 Networking performance Improving intrahost single-stream throughput Next bottleneck: NAPI CPU utilisation After TX completion latency, the next bottleneck is that netback s NAPI thread (softirq context) fully utilises a CPU. Ideas to reduce NAPI CPU utilisation 1 Avoid spilling over into a frag-list by copying more Rationale: It s much more costly to handle an skb with a frag-list, so try to fit the data into a single skb. For intrahost VM-to-VM traffic, around 30% of skbs have a frag-list. 2 Unbatch grant-map Rationale: Historically, batching was best due to the overheads in the hypercall. But recent improvements in grant-map locking means it s no longer so expensive. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

21 Networking performance Improving intrahost single-stream throughput Avoiding frag-lists and unbatching grant-map tsc/ rx netfront passing skb to kernel rx netfront put on rxq rx netfront filling frags rx netfront dequeuing skb from tmpq rx netfront enqueuing skb on tmpq rx netfront reading from rx slot tx netfront freeing skb tx netfront received tx response dealloc thread sent tx response dealloc thread releasing dealloc thread got from dealloc ring rx netback put in dealloc ring rx netback freeing skb rx netback dequeued from rxq rx netback gntcpy finished rx netback enqueued in rxq rx netback kicking receive thread rx netback device received skb bridge delivered skb bridge received skb tx netback passing skb to kernel tx netback filling frags tx netback dequeued skb from tx_queue tx netback finished gntmap tx netback finished gntcpy tx netback finished build_gops tx netback enqueued on tx_queue tx netback allocated skb tx netback reading from first tx slot tx netfront written to last tx slot tx netfront written to first tx slot tx kernel passes skb to netfront tx kernel passes skb to ip layer tx kernel clones skb tx kernel in tcp_transmit_skb tx kernel calling tcp_transmit_skb Red boxes: periods when NAPI is not running Two CentOS 7.0 VMs ( kernel) on Dell R720 (2 Xeon E v2) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

22 Networking performance NAPI CPU utilisation bottleneck Improving intrahost single-stream throughput These ideas make the datapath look a lot cleaner, but don t reduce the CPU utilisation noticeably. Conclusion Further work required to increase the efficiency of the NAPI thread. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

23 Outline Networking performance Improving intrahost aggregate throughput 1 The virtualisation performance challenge 2 Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

24 Networking performance Improving intrahost aggregate throughput Intrahost aggregate throughput measurements XenServer Gb/s Target more is better Dell R730 (2 Xeon E v3) s Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

25 Networking performance Improving intrahost aggregate throughput Intrahost aggregate throughput analysis Intrahost aggregate throughput is typically limited by dom0 CPU utilisation. Ideas to improve aggregate throughput 1 Improve grant-map scalability: per-vcpu maptrack free lists already in Xen 4.6 per-active entry locking already in Xen 4.6 avoid TLB flush on unmap patches proposed by Malcolm Crossley 2 Provide dom0 with more CPU power Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

26 Networking performance Improving intrahost aggregate throughput Grant-map locking improvements have really helped before improvements after improvements Aggregate intrahost throughput, 40 VMs aggregate throughput (Gb/s) dom0 vcpus Dell R730 (2 Xeon E v3) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

27 Outline Networking performance Summary 1 The virtualisation performance challenge 2 Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

28 Summary Networking performance Summary Bottlenecks with intrahost VM-to-VM throughput (listed in order): TX completion latency potential mitigation using skb_orphan NAPI CPU utilisation prototype showed minimal improvement Bottlenecks with aggregate intrahost throughput: dom0 CPU utilisation already improved in Xen 4.6 Future work Work to minimise TX completion latency required to avoid regression with recent kernels Further optimisations need implementing Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

29 Outline Storage performance 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

30 Storage performance Xen is weakest in single-vbd performance Metric Xen s performance Single-VBD throughput weak Multiple-VBD aggregate throughput strong For example, consider 4 KB serial IOPS: XenServer 6.5 Target more is better Debian 6.0 VM on Dell R815 (Opteron 6272), Intel S3700 SSD Deficiencies with single-vbd performance 1 Latency is too high 2 Not enough data in-flight 3 Backend CPU utilisation too high Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

31 Outline Storage performance Reduce latency 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Reduce latency Allow more data in-flight Summary Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

32 Reduce latency Storage performance Reduce latency The problem Latency is too high. This especially impacts serial I/O with small block sizes. XenServer uses tapdisk3, a user-space backend using grant-copy via the gntdev. Ideas to reduce latency 1 Polling in the backend Rationale: Event-channel and backend-scheduling latency is too high. 2 Use grant-map in the backend Rationale: In principle, grant-copy should be slower than grant-map. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

33 Storage performance Idea 1: Polling in the backend Reduce latency Single-threaded sequential reads, queue-depth 1, varying block size with polling (1 ms) without polling IOPS block size (KB) Debian 6.0 VM on Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

34 Storage performance Idea 1: Polling in the backend Reduce latency Polling for just 1 millisecond can yield a significant improvement 1. The faster the disk, the bigger the improvement 2. Conclusion XenServer will likely adopt polling in tapdisk3. But we need to be careful about eating too much CPU, which can hurt multi-vbd aggregate throughput. 1 On blkback the improvement may be even larger. 2 Until the tapdisk3 process fully utilises a CPU even when not polling the next bottleneck. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

35 Storage performance Reduce latency Idea 2: Grant-map in the backend 8000 Single-threaded sequential reads, queue-depth IOPS block size (KB) Debian 6.0 VM on Dell R720, Intel S3700 SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

36 Storage performance Reduce latency Idea 2: Grant-map in the backend So grant-copy is still faster in practice, despite recent improvements to grant-map locking. This suggests inefficiency issues with the gntdev...? Conclusion XenServer will likely retain grant-copy for now. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

37 Outline Storage performance Allow more data in-flight 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Reduce latency Allow more data in-flight Summary Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

38 Storage performance Allow more data in-flight Allow more data in-flight The problem Each blkif ring supports 32 slots, each of which can address up to 44 KB, i.e. a total of MB. Meanwhile, modern disks and arrays can give better throughput when issued with more than this. Ideas to get more data in-flight 1 Multi-queue patches proposed by Bob Liu Rationale: more than one blkif ring per device 2 Multi-page ring patches proposed by Bob Liu Rationale: larger blkif ring 3 Indirect descriptors available since kernel 3.11 Rationale: ability to address more data per ring slot Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

39 Storage performance Allow more data in-flight Idea 1: Multi-queue measurements Sequential reads, 8 threads, queue-depth 32, varying block size IOPS block size (KB) Ubuntu VM using blkback on Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

40 Storage performance Allow more data in-flight Idea 1: Multi-queue measurements in context Sequential reads, 8 threads, queue-depth 32, varying block size IOPS block size (KB) Ubuntu VM using blkback on Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

41 Idea 1: Multi-queue Storage performance Allow more data in-flight Adding multi-queue support hurts performance for small block sizes. Explanation Explanation pending! The guest does no request merging. We rely on merging to get good performance on modern disks for sequential I/O. Conclusion Unless the sequential I/O performance when requests are merged can be retained, XenServer will likely not adopt multi-queue. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

42 Storage performance Allow more data in-flight Idea 2: Multi-page ring: good for random I/O Random 4 KB reads, queue-depth 4, varying number of threads IOPS Ubuntu VM (16 vcpus) using blkback on Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

43 Storage performance Allow more data in-flight Idea 2: Multi-page ring: poor for sequential I/O Sequential reads, 8 threads, queue-depth 32, varying block size IOPS block size (KB) Ubuntu VM (4 vcpus) using blkback on Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

44 Storage performance Idea 2: Multi-page ring Allow more data in-flight Improves random I/O throughput by over 50% when the ring would otherwise be full. But reduces sequential I/O throughput for small block sizes and high queue depth. Explanation The guest kernel does not merge requests when there is a multi-page ring. Conclusion Further work needed to mitigate effect on request merging. XenServer will likely retain use of a single-page ring for now. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

45 Storage performance Idea 3: Indirect descriptors Allow more data in-flight Background Indirect descriptors has been available in blkfront/blkback since This allows up to 1 MB to be addressed per ring slot, meaning the total in-flight data can be 32 MB rather than MB. But is this actually a good thing? Most modern disks respond better to smaller requests... Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

46 Storage performance Allow more data in-flight Idea 3: Indirect descriptors is it worthwhile? Reading direct from physical disk, splitting requests into chunks issued in parallel chunk size (KB) Dell R720 (2 Xeon E v2), Micron P320h SSD Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

47 Storage performance Idea 3: Indirect descriptors Allow more data in-flight Conclusion On modern disks, throughput generally improves by splitting large requests into 44 KB chunks! Allowing bigger requests through can hurt performance. Ideally we need the Linux block layer to know the disk s optimal block size, and to split or merge requests accordingly. Then indirect I/O would present an improvement by allowing more data in flight. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

48 Outline Storage performance Summary 1 The virtualisation performance challenge 2 Networking performance 3 Storage performance Reduce latency Allow more data in-flight Summary Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

49 Summary Storage performance Summary Reduce latency: Polling promising results Grant-map needs more work for userspace backend Allow more data in-flight: Multi-queue prevents request merging Multi-page ring prevents request merging Indirect descriptors prevents use of optimal block size Future work Improve performance of gntdev Better strategy for getting more data in-flight whilst ensuring that requests are of optimal size Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

50 Questions Questions? Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 50

51 Extra slides There s little benefit from batching nowadays Dell R220 (Xeon E v3) Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug / 1

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development