MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Size: px

Start display at page:

Download "MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces"

Sylvia Wiggins
5 years ago
Views:

1 MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,

2 CONTENTS Introduction Background & Motivation MiAMI Performance Measurement Conclusions & Future Work

3 INTRODUCTION Multi-Core Processors Leveraging throughput and scalability Boosting concurrent processes Parallel programs Network server programs x Top 500 Processor Generation June 2009

Interconnect Family June 2009 Multiple Network Interfaces

4 INTRODUCTION High-Speed Interconnects Gigabit Ethernet 10 gigabit networks InfiniBand Myrinet 10 Gigabit Ethernet Top 500 Interconnect Family June 2009 Multiple Network Interfaces Cost-effective high network bandwidth High availability Tsukuba Univ. MegaProto/E

Awareness of multi-core processors Support for single to

5 MOTIVATION Are multi-core processors being utilized efficiently for multiple network interfaces? Awareness of multi-core processors Support for single to multiple network interfaces Transparency for existing middleware and applications

6 ITEM #1: Multi-Core Processors Cache Layout PACKAGE 0 PACKAGE 1 CORE 0 CORE 1 CORE 2 CORE 3 CORE 4 CORE 5 CORE 6 CORE 7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 2-Way Quad-Core SMP (Intel Clovertown) Processor Affinity Process affinity Mapping between processor and process Interrupt affinity Mapping between processor and device (interrupt)

L1 L1 L1 L2 L2 L2 L2 IRQ Gigabit Ethernet NIC 2-Way Quad-Core SMP (Intel

7 ITEM #1: Multi-Core Processors TASK TASK TASK TASK TASK TASK TASK TASK PACKAGE 0 PACKAGE 1 CO RE0 CO RE1 CO RE2 CO RE3 CO RE4 CO RE5 CO RE6 CO RE7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 IRQ Gigabit Ethernet NIC 2-Way Quad-Core SMP (Intel Clovertown) Process Affinity Bad (Other Affinity) Not Bad (Cache Affinity) Good (Core Affinity)

8 ITEM #2: Multiple Network Interfaces Multi-port network interface Silicom, Chelsio, Intel, etc. Receive-Side Scaling (RSS) Enables packet receive-processing to scale with the number of available processors Intel, Chelsio, etc. Self-Virtualized I/O Devices Virtual interfaces

9 ITEM #2: Multiple Network Interfaces Tricky Issues Relative optimal process affinity P 0 P n P n+1 P k+1? P k Processes using multiple interfaces? P k Core0 Core1 Core0 Core1 NIC0 NIC1 NIC0 Cache NIC1 Generalization for N network interfaces (N 1)

10 ITEM #3: Transparency autopin and Vtune Static tuning sched_setaffinity() System Call Application-level modifications SyMMer Middleware-level modifications (e.g., MPI) Kernel-Level Process Scheduling Considerations for multi-core processors Considerations for N network interfaces

11 MiAMI Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Basic Ideas Multi-core awareness Cache layout and interrupt affinity aware process scheduling N network interfaces Arbitration between processes Transparency Adaptive kernel-level scheduling

12 OVERALL DESIGN Communication System Calls Communication Load Monitor Trigger Set Get /proc File System Configuration Tool Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Parameters Timer Scheduling Agent

13 SYSTEM INFORMATION Interrupt Affinity To know optimal core Cache Layout To decide sub-optimal cores Communication Load Monitor Communication Intensiveness InterruptAffinity Cache Layout Processor Load Processor Loads Configuration Tool To deal with the tradeoff efficiently Optimal process affinity Processor overloading Scheduling Agent

14 COMMUNICATION LOAD MONITOR Traces the communication intensiveness Recent communication history Intensiveness = 10 6 / Average Average The average value of recent communication intervals Range from 1 to 10 6 Intensiveness The higher value means higher Communication Load Monitor Communication Intensiveness communication intensiveness Range from 1 to 10 6 Cache Layout Interrupt Affinity Processor Load Configuration Tool Scheduling Agent

15 COMMUNICATION INTENSIVENESS Bandwidth(Mbps) Communication Intensiveness Interval between Send System Calls (us) Interval between Send System Calls (us)

16 COMMUNICATION INTENSIVENESS Communication Intensiveness Tables Three classes High Middle Low Roughly sorted lists Can insert into the list in O(1) Still can choose a (sub)optimal victim to migrate into an idle/optimal core Dynamic class boundaries Adjusted by the scheduling agent

17 SCHEDULING AGENT Input Parameters Communication intensiveness Cache layout Interrupt affinity Processors load Three Levels of Affinity Core-level affinity Cache-level affinity Other-level affinity Configuration Tool Communication Load Monitor Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Scheduling Agent Basic Policy Move the communication-intensive processes to the core-level affinity state as much as possible if the processor load allows

18 SCHEDULING POLICY Step1: Releases idle networking processes Scheduled by the legacy Linux process scheduler Step2: Moves communication-intensive processes Other-level or cache-level -> core-level affinity Other-level -> cache-level affinity Step3: Mitigate the processor s load Adjusts the class boundaries Moves less communication-intensive processes into idle cores

19 PROCESS AFFINITY STATES Core Affinity Cache Affinity Processor Overload & No Cache-Level Cores Processor Underload & Cache-Level Cores Other Affinity Communication Intensive & Subptimal Core is Idle No Affinity Communication Intensive & Optimal Core is Idle

20 DATA STRUCTURES Core 0 Core 1 Core 2 Core n miami_cpu System Information Cache Layer Processor Load Process Affinity Core-Level Affinity Cache-Level Affinity Other-Level Affinity Intensiveness Classes High Middle Low High Middle Low High Middle Low MiAMI List miami_node PCB ptr Time stamp History Intnsvns TCP/IP Control Block PCB ptr Time stamp History Intnsvns PCB ptr Time stamp History Intnsvns Process Control Block Process Runqueue

21 EXPERIMENTAL SYSTEM PACKAGE PACKAGE COR E0 COR E1 COR E2 COR E3 COR E4 COR E5 COR E6 COR E7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 8-Port NIC 2-Way Quad-Core Processor (Intel Xeon Clovertown and AMD Opteron Barcelona) 24 NIC Ports 2 NIC4 NIC 2 NIC 2 NIC2 NIC Ports Ports Ports Ports Ports 2S 4N 2N 4S Case 42 NIC Ports 2 NIC Ports Two 4-port 1GigE NICs (Silicom PXG4)

22 MICROBENCHMARK Large Volume Data Transmission (Bandwidth Test) From 8 To 32 TCP Connections

23 MICROBENCHMARK: SMP CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU Net Net MiAMI-Net MiAMI can save more processor resources (up to 65%) compared with the others

24 SCENARIO-BASED BENCHMARK 1 Large Volume Data Transmission: 8 TCP Connections Ping-Pong Communication: 8 to 24 TCP Connections

25 SCENARIO-BASED BENCHMARK Improvement (%) Improvement (%) ,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 8,8 8, ,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Processes / Interrupt Affinity # of Processes / Interrupt Affinity Bandwidth Latency Processor Utilization Bandwidth Latency Processor Utilization <SMP System> <NUMA System> MiAMI can prevent frequent migration and deliver better performance

26 SCENARIO-BASED BENCHMARK 2 Process Type Communication Interval Start Point Number of Processes A 50us 0s 2 B 20us 10s 2 C 0us 20s 1 0s 50us 10s 20s 20us

27 SCENARIO-BASED BENCHMARK Improvement (%) Improvement (%) Aggregate Bandwidth Bandwidth of Process C <SMP System> Processor Utilization -10 Aggregate Bandwidth Bandwidth of Process C <NUMA System> Processor Utilization MiAMI can figure out which connection is the most communication intensive and decide the optimal distribution of networking processes over multiple cores

28 CONCLUSIONS MiAMI Multi-core aware networking process scheduling Generalization for multiple network interfaces Transparent scheduling Experimental Results Processor utilization SMP : up to 65% NUMA : up to 63% Network performance More than 30% with less processor resources

29 FUTURE WORK Performance Measurement Real applications Web servers Parallel file systems Various scenarios for multiple interfaces RSS Self-virtualized network device Extension of MiAMI Better scheduling policies Other I/O peripheral devices

Thanks! Hyun-Wook (JIN) Jin http://home.konkuk.ac.

30 Thanks! Hyun-Wook (JIN) Jin System Software Laboratory Konkuk University

31 NETWORK DATA PATH User Level Kernel Level Kernel Buffer Device Level Physical Level User Application Socket Send System Call Device Driver Device Network Protocol User Application Socket Receive System Call Kernel Buffer Bottom Half(Softirq) And Network Protocol Soft Interrupt Device Driver (Interrupt Handler) Hard Interrupt Device

32 ITEM #1: Multi-Core Processors PACKAGE 0 PACKAGE 1 CO RE0 CO RE1 CO RE2 CO RE3 CO RE4 CO RE5 CO RE6 CO RE7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 2-Way Quad-Core NUMA (AMD Barcelona)

33 CONFIGURATION TOOL The user-level MiAMI interfaces To set up the policies Behavior parameters The interval of scheduling agent Communication Load Monitor Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Configuration Tool Scheduling Agent

B TCB Z SCHEDULING AGENT A B Y Z Low Intensiveness

34 COMMUNICATION LOAD MONITOR PCB Communication System Calls COMMUNICATION LOAD MONITOR TCB A TCB B TCB Z SCHEDULING AGENT A B Y Z Low Intensiveness Low Intensiveness High Intensiveness High Intensiveness

35 EXPERIMENTAL SYSTEM PACKAGE 0 PACKAGE 1 COR E0 COR E1 COR E2 COR E3 COR E4 COR E5 COR E6 COR E7 L1 L1 L1 L1 L2 L2 L2 L2 L3 L1 L1 L1 L1 L2 L2 L2 L2 L3 RAM RAM 2-Way AMD Opteron 2356 processors (Barcelona Quad-Core) 4GB RAM Dual onboard NVIDIA MCP55 NICs One HotLava 6CGNIC-X 6-port 1GigE NIC 24 NIC Ports 2 NIC4 NIC 2 NIC 2 NIC2 NIC Ports Ports Ports Ports Ports 2S 4N 2N 4S Case 42 NIC Ports 2 NIC Ports

36 MICROBENCHMARK: SMP CPU Utilization (%) Aggregate Bandwidth (Mbps) CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU Net Net MiAMI-Net CPU CPU MiAMI-CPU Net Net MiAMI-Net <Sending Execution of SMP> <Receiving Execution of SMP>

37 MICROBENCHMARK: NUMA CPU Utilization (%) Aggregate Bandwidth (Mbps) CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU CPU CPU MiAMI-CPU Net Net MiAMI-Net Net Net MiAMI-Net <Sending Execution of NUMA> <Receiving Execution of NUMA> MiAMI can save more processor resources compared with the others

38 What we are missing Optimal priority Core-level affinity High, middle, and low Cache-level affinity High, middle, and low Other-level affinity High, middle, and low Optimal equation for evaluating the communication intensiveness Data volume Tradeoff between migration overhead and optimal affinity Optimal timer resolution Scheduling agent Optimal interrupt affinity Static is not sufficient IrqB is not sufficient Optimal thresholds Processor overloading Boundaries for comm. Intensiveness # of migration

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere