Improving Driver Performance A Worked Example

Similar documents
Intel Unite. Intel Unite Firewall Help Guide

DPDK-in-a-Box. David Hunt - Intel DPDK US Summit - San Jose

Clear CMOS after Hardware Configuration Changes

Intel Unite Solution. Linux* Release Notes Software version 3.2

Intel Unite Plugin Guide for VDO360 Clearwater

Intel Software Guard Extensions Platform Software for Windows* OS Release Notes

DPDK Performance Report Release Test Date: Nov 16 th 2016

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

DPDK Intel NIC Performance Report Release 18.02

Intel & Lustre: LUG Micah Bhakti

Intel Xeon W-3175X Processor Thermal Design Power (TDP) and Power Rail DC Specifications

Intel Unite Solution Version 4.0

DPDK Intel NIC Performance Report Release 18.05

DPDK Intel NIC Performance Report Release 17.08

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Intel Unite Solution Version 4.0

Intel Unite Solution Version 4.0

Intel Unite. Enterprise Test Environment Setup Guide

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Intel Compute Card Slot Design Overview

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Intel Software Development Products Licensing & Programs Channel EMEA

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Jomar Silva Technical Evangelist

Intel Desktop Board DZ68DB

Installation Guide and Release Notes

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Intel Xeon Phi Coprocessor Performance Analysis

Intel Unite Solution. Plugin Guide for Protected Guest Access

Bridging the gap between hardware functionality in DPDK applications and vendor neutrality in the open source community

LED Manager for Intel NUC

Intel Cluster Ready Allowed Hardware Variances

Intel Unite Solution Intel Unite Plugin for WebEx*

Tutorial: Finding Hotspots with Intel VTune Amplifier - Linux* Intel VTune Amplifier Legal Information

DPDK Vhost/Virtio Performance Report Release 18.11

Modernizing Meetings: Delivering Intel Unite App Authentication with RFID

Bei Wang, Dmitry Prohorov and Carlos Rosales

Installation Guide and Release Notes

DPDK Summit China 2017

Optimization of Lustre* performance using a mix of fabric cards

Using Intel VTune Amplifier XE for High Performance Computing

Data Plane Development Kit

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel Parallel Amplifier Sample Code Guide

DPDK Vhost/Virtio Performance Report Release 18.05

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

Localized Adaptive Contrast Enhancement (LACE)

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel Cache Acceleration Software for Windows* Workstation

Zhang, Hongchao

Intel System Debugger 2018 for System Trace Linux* host

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Fan Yong; Zhang Jinghai. High Performance Data Division

Intel Unite Solution Version 4.0

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Intel QuickAssist for Windows*

Intel Server Board S2600CW2S

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Intel Media Server Studio 2018 R1 Essentials Edition for Linux* Release Notes

Intel Software Guard Extensions (SGX) SW Development Guidance for Potential Bounds Check Bypass (CVE ) Side Channel Exploits.

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Vectorization Advisor: getting started

Intel Desktop Board D946GZAB

Intel G31/P31 Express Chipset

Using Intel VTune Amplifier XE and Inspector XE in.net environment

IXPUG 16. Dmitry Durnov, Intel MPI team

Intel Desktop Board D975XBX2

Memory & Thread Debugger

Intel Server Board S2400SC

Intel Unite Solution. Plugin Guide for Protected Guest Access

Intel Analysis of Speculative Execution Side Channels

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Product Change Notification

Product Change Notification

Building an Android* command-line application using the NDK build tools

Intel Server Board S5520HC

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

DPDK s Best Kept Secret: Micro-benchmarks. M Jay DPDK Summit - San Jose 2017

Intel Desktop Board D945GCLF2

Intel Cluster Checker 3.0 webinar

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

A Simple Path to Parallelism with Intel Cilk Plus

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Real World Development examples of systems / iot

ENVISION TECHNOLOGY CONFERENCE. Functional intel (ia) BLA PARTHAS, INTEL PLATFORM ARCHITECT

Product Change Notification

Välkommen. Intel Anders Huge

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

DIY Security Camera using. Intel Movidius Neural Compute Stick

Transcription:

USERSPACE, October 2016 Improving Driver Performance A Worked Example Bruce Richardson

Legal Disclaimers No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What s Inside, and the Intel. Experience What s Inside logo are trademarks of Intel. Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Introduction This is a real-world use-case of performance improvement using performance analysis tools in this case Intel VTune Amplifier The issues discovered here were largely discovered by accident when investigating other DPDK behaviour Hope the walk-through of the issue debug may be of use to others when dealing with performance issues 3

Intel Vtune Amplifier Tool for performance analysis and debugging Uses hardware event counters to report on issues affecting program execution, e.g. CPU stalls due to memory access 4

Test Setup 2 x Intel Ethernet Converged Network Adapter XL710-QDA1 Traffic Generator DUT (Intel Xeon E5-2699 v3 @ 2.30GHz) 5

6

What does this mean? We are doing a write to a memory location We are subsequently doing a read from that memory location The read is getting blocked by the write because the read is for more data than just the write if the write is for the same amount of data, then we can do store forwarding to return the data write to the read without hitting cache/mem if the read can be delayed till later, the write can complete and the read can come from cache This should not generally show up as a problem in good code 7

/* D.1 pkt 3,4 convert format from desc to pktmbuf * pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk); vmovdqax -0x18(%rsp), %xmm11 vmovdqax -0x28(%rsp), %xmm1 vmovdqax -0x48(%rsp), %xmm3 8

Reading this case In this case, both the instruction highlighted and the previous one are 128-bit loads. Therefore either one is a potential candidate for the source of the delay If we assume that this is the read, then we need to find the offending write: occurs previous to these [nice and easy to find in C, as there is a compiler barrier] is of a size smaller than 128-bits 9 9

Other Weirdness Why does the code: pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk); cause a read from memory at all? Shouldn t descs[2] have already been loaded to xmm register previously at the line? descs[2] = _mm_loadu_si128(( m128i *)(rxdp + 2)); Let s look at that previous load lines in vtune 10

descs[3] = _mm_loadu_si128(( m128i *)(rxdp + 3)) vmovdqux 0x60(%r9), %xmm3 vmovapsx %xmm3, -0x18(%rsp) 11

What is happening? A load instrinsic is resulting in an xmm load followed by a store? We have a second mystery. Let s trace back through what happens to the descriptors through the code between the two points 12

desc_pktlen_align Only work done between desc[2] load and the offending line is function desc_pktlen_align Again look at assembler listing 13

Vector Code Scalar Code [Including writes] 14

The Source of the problem When assigning the lengths, we use 16-bit writes: have to go to memory (can t assign to an xmm register) causes the xmm load to immediately store to stack causes the shuffle op to trigger a second load That second load (128b) blocks on 16b write pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); pktlen0 = _mm_packs_epi32(pktlen0, zero); vol.dword = _mm_cvtsi128_si64(pktlen0); /* let the descriptor byte 15-14 store the pkt len */ *((uint16_t *)&descs[0]+7) = vol.e[0]; *((uint16_t *)&descs[1]+7) = vol.e[1]; *((uint16_t *)&descs[2]+7) = vol.e[2]; *((uint16_t *)&descs[3]+7) = vol.e[3]; 15

The Fix Rewrite to use entirely vector operations: keeps things in xmm registers saves unnecessary loads and stores prevents store forward errors improves performance. makes people happy* pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); pktlen0 = _mm_slli_epi64(pktlen0, 16); descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); pktlen0 = _mm_slli_epi64(pktlen0, 16); descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); pktlen0 = _mm_slli_epi64(pktlen0, 16); descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); *NOTE: happiness not guaranteed 16

Result Stalls due to Loads Blocked by Store Forwarding dropped about 19x: Before: 0.189 After: 0.010 Overall PMD performance measured by testpmd increased by over 5%. 17