Impact of Cache Coherence Protocols on the Processing of Network Traffic

Size: px

Start display at page:

Download "Impact of Cache Coherence Protocols on the Processing of Network Traffic"

Lucas Elliott
5 years ago
Views:

1 Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007

2 Outline Background Network performance improvement with new microarchitecture Need to revisit platform changes for CPU on loading Overview of existing and Prefetch-hint coherence protocols Direct Cache Access (DCA) Performance Overview Prototype Results Future Research 2

3 Background & Motivation Adoption of 10Gbps has been limited to a few applications. A primary reason has been the processing capability of general purpose platforms. Recent micro-architectural changes offered by Intel Core TM processors has shown 66% higher network processing capability over a previous generation Intel Pentium 4 architecture Providing a coherence protocol that places data into CPU cache further improves processing capabilities Our prototype implementation of Direct Cache Access (DCA) shows 15.6% % speed up 3

4 Background & Motivation (contd.) Solutions to reduce TCP/IP processing overhead can be classified in three categories: Platform improvements to improve CPU on loading Copy specific solutions have been user level TCP/IP stack, Page flipping etc. TCP Offload Engines (TOEs) Uses hardware assists to offload main CPU. Limited to small spectrum of networking applications. Interconnects or protocols like Infiniband, Myrinet or RDMA Requires new hardware-software interfaces which requires application support. In some cases, it requires expensive NIC solutions as well. New micro-architectural efficiencies provide a greater impetus for CPU on loading and diminishes need of specialized solutions. 4

5 Opportunity for DCA in Realistic Workloads Source: Direct Cache Access for High Bandwidth Network I/O. 32nd Annual International Symposium on Computer Architecture (ISCA'05) pp Ram Huggahalli, Ravi Iyer and Scott Tetrick. % of Inbound I/O data Read by CPU vs. Distance 10 TPC-W Rx (512B to 4KB) 8 SPECWeb99 % Occurence Network I/O: 8 read within a short time TPC-C System Bus Clocks (x1000) 5

6 Today s Coherence Protocol 1. Packet arrives on the NIC from the network 2. NIC sends the packet as I/O bus transactions to the Chipset 3. Chipset ensures coherency of data by snooping processor caches before writing to memory 4. Processor eventually reads packet for TCP/IP processing and moves data to application buffer Before CPU is interrupted After CPU is interrupted Network NIC Chipset Memory Packet Coherent Memory Write Snoop Writeback Memory Write Read Read Data Processor (L1/L2 line in M state) Demand read from CPU is a compulsory cache miss Coherence protocol for inbound I/O 6

7 Prefetch Hint Protocol 1. Packet arrives on the NIC from the network 2. NIC sends the packet as I/O bus transactions (with a target cache tag) to the Chipset 3. Chipset sends snoops to the processor with hints to prefetch the data 4. Processor prefetches packet soon after hint is received. Packet is present in the cache TCP/IP processing begins Before CPU is interrupted Network NIC Chipset Memory Packet Coherent Memory Write Snoop-Hint Writeback Memory Write Prefetch Read Read Data Processor (L1/L2 line in M state) Coherence protocol for DCA prototype 7

8 Impact of Prefetch Hint/DCA protocol ns per Packet 4KB I/O 4500 copy tcp other (driver, os, app interface) 4000 core core core core 3500 L2 Cache L2 Cache ns per Packet ns 256 ns 2481 ns 4 ch FBD- 667 MHz 20.8 GB/s peak read bandwidth Memory Controller Hub 1GbE 2x1GbE NIC PCIe 1GbE FSB 1333 MHz, 10.4 GB/s (peak) To system similar to SUT Source: Intel 1002 ns Base 148 ns 179 ns DCA System Configuration Copy with DCA is 5x faster and TCP/IP processing is 1.5x faster 8

9 DCA Performance & Sensitivities Speed-up (a) Normalized Throughput per Core (Gb/s) and Speed-up Base DCA Speed-up 15.6% 16.4% 33.1% 40.2% 42.4% 42.2% Network Throughput (Gb/s) Speed-up (c) Normalized Throughput per Core (Gb/s) and Speed-up vs. TCP Connections 2 connections Speed-up Base DCA connections Throughput / core (Gb/s) I/O Size (bytes) Total TCP Connections (across 2 ports) - Log Scale Speed-up (b) 45% 4 35% 3 25% 2 15% DCA Speed-up with and without Memory Loading DCA Speed-up (No Load) DCA Speed-up (Mem Load) 1 5% I/O Size (bytes) 9 Perf Gain or Loss Perf Gain or Loss (d) SPECintRate with Network Traffic 8.3% 8.1% 7.7% 5.4% 5.9% 6.9% 8% 5.3% 6.2% 5.5% 3.9% 4. 4% 1.8% 2.6% (e) 8% 6.1% 6.3% 6.7% % 4% 1.4% SPECfpRate with Network Traffic -4% -2.6% 9.5% % 7.1% 8.6% % 1.5% gzip vpr gcc mcf crafty wupwise parser swim eon mgrid perlbmk applu gap mesa vortex galgel bzip2 art equake twolf facerec GEOMEAN ammp lucas fma3d sixtrack apsi GEOMEAN

10 Future Research DCA next steps: Protocol Optimization Bypass memory and write incoming data directory into LLC (Write Update protocol) Performance improvement with DCA at 10Gbps and real application benefit Related future work: Read Current It is a network transmit optimization where the cached buffer used to transmit data remains in the same state in the cache Cache QoS Network processing cycles through kernel buffers through the CPU cache evicting other useful data. Cache QoS policies will restrict such pollution by restricting network data to few ways in the cache CPU-NIC Integration Integrating NIC on CPU can unveil many opportunities that traditional SW and HW don t enjoy. A bigger ecosystem uplift is required to make effective use of NIC integration 10

11 Q&A Disclaimer: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit 11

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload