Topic: A Deep Dive into Memory Access. Company: Intel Title: Software Engineer Name: Wang, Zhihong

Size: px

Start display at page:

Download "Topic: A Deep Dive into Memory Access. Company: Intel Title: Software Engineer Name: Wang, Zhihong"

Sharyl Cole
6 years ago
Views:

1 Topic: A Deep Dive into Memory Access Company: Intel Title: Software Engineer Name: Wang, Zhihong

2 A Typical NFV Scenario: PVP Guest Forwarding Engine virtio vhost Forwarding Engine NIC Ring ops What s actually going on? memcpy DDIO Guest virtio virtio RX TX Shared Memory vhost TX NIC RX vhost RX NIC TX Ring ops memcpy DDIO

3 Overview of Memory System 0 1 N-1 N N+1 CPU 0 CPU 1 2N-1 MESIF protocol Memory 0 Memory 1

Overview of Memory System (cont d) AVX 1 to maximize bandwidth 0 Load 2 1 N-1 N N+1 CPU 0 Haswell cache parameters 3 CPU 1 Line Fastest Cache level size latency 5 (Bytes)

4 Overview of Memory System (cont d) AVX 1 to maximize bandwidth 0 Load 2 1 N-1 N N+1 CPU 0 Haswell cache parameters 3 CPU 1 Line Fastest Cache level size latency 5 (Bytes) (Cycle) 4 6 L1D L ~34 Varies Capacity Peak bandwidth (KB) (Bytes/cycle) Memory L2 and L1D 0 in Memory 1 other cores 64 2N-1 64 (Load) + 32 (Store)

5 Let s Do It! Guest Forwarding Engine virtio vhost Forwarding Engine NIC Target for our analysis Ring ops Where the data flows memcpy DDIO Guest virtio virtio RX TX Shared Memory vhost TX NIC RX vhost RX NIC TX Ring ops memcpy DDIO

6 First Impression Guest -> Guest -> Guest N-1 N N+1 CPU 0 CPU 1 2N-1 FWD RX from NIC TX to vhost RX from vhost TX to NIC Memory 0 Memory 1 VM FWD RX from virtio TX to virtio

7 Unexpectedly Guest -> -> L1 Cross-core copies?? Guest CPU 0 -> Guest N-1 First try Notice CPU cycle measurement disturbs overall performance Memory 0

8 Under The Hood Guest -> Guest updates ring only, doesn t touch the data Guest N-1 N N+1 CPU 0 -> Guest CPU 1 2N-1 Data locality in cache: Who operates the data Memory 0 Memory 1

9 Guest Read The Packet Guest -> Guest R N-1 N N+1 CPU 0 CPU 1 -> Guest FWD RX from NIC TX to vhost RX from vhost TX to NIC Memory 0 Memory 1 2N-1 VM FWD RX from virtio Read packet TX to virtio

10 Still Doesn t Feel Right Guest -> No change?? Guest R CPU 0 N-1 -> Guest Memory 0 Guest read packet

11 Under The Hood Guest -> Cache line can be shared when no modification Guest R N-1 N N+1 CPU 0 CPU 1 -> Guest FWD RX from NIC TX to vhost RX from vhost TX to NIC Memory 0 Memory 1 2N-1 VM FWD RX from virtio Read packet TX to virtio

12 Guest Edit The Packet Guest -> Guest M N-1 N N+1 CPU 0 CPU 1 -> Guest FWD RX from NIC TX to vhost Memory 0 RX from vhost Memory 1 TX to NIC 2N-1 VM FWD RX from virtio Edit packet TX to virtio

13 Write-back Guest -> No change? Cross-core copies Guest M CPU 0 N-1 -> Guest Memory 0 Guest edit packet

14 Go See Some C Code desc_addr = gpa_to_vva(dev, desc->addr); rte_prefetch0((void *)(uintptr_t)desc_addr); Oh I see S/W Prefetching to reduce latency rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset), (void *)((uintptr_t)(desc_addr + desc_offset)), cpy_len);

15 Without S/W Prefetching Guest -> Now I understand Bring it back right away! Guest M CPU 0 N-1 -> Guest Memory 0 Guest edit packet; No prefetching

16 How About Guest In Another Guest -> Node? Guest 1 N-1 M N+1 CPU 0 CPU 1 -> Guest FWD RX from NIC TX to vhost Memory 0 RX from vhost Memory 1 TX to NIC 2N-1 VM FWD RX from virtio Edit packet TX to virtio

17 Better NOT Keep related processes on the same node Guest edit packet; No prefetching

18 rte_memcpy()? Why Even Bother? Warm copy DPDK s scenario AVX load/store Alignment handling Guest edit packet; Guest on the same node

19 AVX For Bandwidth xmm0 = _mm_loadu_si128(src); _mm_storeu_si128(dst, xmm0); ymm0 = _mm256_loadu_si256(src); _mm256_storeu_si256(dst, ymm0); 2x peak bandwidth + 40% + 53% AVX512 is coming Guest read packet; Guest on the same node

20 Alignment Matters rte_memcpy((void *)((uint8_t *)dst + 1), src, len - 1); rte_memcpy(dst, src, len); Just like coupons FREE gifts if you use them + 17% + 15% Guest read packet; Guest on the same node

21 Takeaways See actual memory behaviors under the hood Intel 64 and IA-32 Architectures Optimization Reference Manual Benefit from new IA technologies AVX, DDIO

23 Cache Allocation Technology 0 1 N-1 N N+1 CPU 0 CPU 1 2N-1 CMT + CAT Noisy neighbor: One Memory core is 0 requesting huge amount of data Memory 1 What if another HIGH priority core is very latency sensitive?

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network