GASPP: A GPU- Accelerated Stateful Packet Processing Framework

Size: px
Start display at page:

Download "GASPP: A GPU- Accelerated Stateful Packet Processing Framework"

Transcription

1 GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis, FORTH- ICS, Greece

2 Network Packet Processing Computa5onally and memory- intensive High levels of data parallelism Each packet can be processed in parallel Poor temporal locality for data Typically, each packet is processed only once

3 GPU = Graphics Processing Units Highly parallel manycore devices Hundreds of cores High memory bandwidth Up to 6GB of memory

4 GPUs for Network Packet Processing Gnort [RAID 08] PacketShader [SIGCOMM 10] SSLShader [NSDI 11] MIDeA [CCS 11], Kargus [CCS 12]

5 GPUs for Network Packet Processing Gnort [RAID 08] PacketShader [SIGCOMM 10] SSLShader [NSDI 11] MIDeA [CCS 11], Kargus [CCS 12] Independent/Monolithic Designs 1. A lot of CPU- side code even for simple apps 2. Explicit batching 3. Explicit data copies and PCIe transfers

6 Need a framework for developing GPU accelerated packet processing applica5ons

7 GASPP Framework AES Regex Match String Match Firewall TCP Flow State Management Packet Reordering Packet Scheduling Packet Decoding

8 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support

9 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support

10 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Use a single user- space buffer between the NIC and the GPU Stage packets back- to- back to a separate buffer

11 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Packets size (#bytes) Gbit/s >! Packets size (#bytes) Gbit/s

12 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Packets size (#bytes) Gbit/s <! Packets size (#bytes) Gbit/s

13 Why staging is be_er than zero- copy (for small packets) NIC s Packet Buffer: Staging buffer: Be1er space u6liza6on => No redundant transfers

14 Selec5ve scheme Main Memory Main Memory GPU DMA Buffer NIC DMA Buffer GPU CPU NIC DMA Buffer Packets are are copied back- to- back to a separate buffer, if the buffer occupancy is sparse Otherwise, they are transferred directly to the GPU

15 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support

16 Modular and Flexible Basic abstrac5on of processing: ``modules processpacket(packet){... } RX Modules are executed sequen5ally or in parallel processpacket( ) processpacket( ) processpacket( ) TX Module 1: IP- learn Module 2: Content Inspection Module 3: Encryption

17 Batch Processing Pipeline [me RX Module1 Module2 TX

18 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch

19 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing

20 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing

21 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing copy to CPU TX batch

22 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support

23 Single Instruc5on, Mul5ple Threads SIMT group (warp) Threads within the same warp have to execute the same instruc5ons Great for regular computaeons!

24 Parallelism in packet processing Network traffic Batch Size (#packets) Network packets are processed in batches More packets => more parallelism

25 Dynamic Irregulari5es Batch Size (#packets) Received network packets mix is very dynamic

26 Dynamic Irregulari5es Batch Size (#packets) Received network packets mix is very dynamic Different packet lengths

27 Dynamic Irregulari5es module 1 module 2 module 3 Batch Size (#packets) Received network packets mix is very dynamic Different packet lengths Divergent parallel module processing

28 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7

29 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7

30 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7

31 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7

32 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 Low warp occupancy

33 Packet grouping Batch Size Batch Size

34 Packet grouping [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 Ø Harmonized execu5on Ø Symmetric processing

35 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support

36 TCP Flow State Management Connection Records Connection Record Hash key : 4 bytes State : 1 byte Seq CLIENT : 4 bytes Seq SERVER : 4 bytes Next : 4 bytes Connection Table Maintain the state of TCP connec5ons

37 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3

38 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing

39 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing

40 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing Packet- level parallel processing

41 TCP Stream Reassembly Batch Size A B C Key insight Packets <A, B> are consecu5ve if Seq B = (Seq A +len A )

42 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A A

43 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A B A B

44 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A B A C B C

45 TCP Stream Reassembly Batch Size A B C A B A C B C Parallel Processing

46 TCP Stream Reassembly Batch Size A B C A B A C B C next_packet: A B C index: A B - C

47 Other TCP corner cases TCP sequence holes Out- of- order packets

48 Other TCP corner cases TCP sequence holes Out- of- order packets

49 Evalua5on Forwarding Latency Individual Applica5ons Consolidated applica5ons

50 Evalua5on Setup generated traffic Packet generator (4x 10GbE ports) 40 Gbit/s forwarded traffic GASPP machine (4x 10GbE ports) GASPP machine has: 2x NUMA nodes (Intel Xeon E GHz quad- core CPUs) 2x banks of 6GB of DDR3 1066MHz RAM 2x Intel 82599EB network adapters (with dual 10GbE ports) 2x NVIDIA GTX480 graphics cards

51 Basic Forwarding Throughput (Gbit/s) Rx+Tx Rx+GPU+Tx Effective Packet size (bytes) Rx+Tx CPU (8x cores) GASPP Rx+GPU+Tx Effective

52 Latency 8192 batch: CPU: 0.48 us GASPP: 3.87 ms 1024 batch: 0.49 ms Same performance for basic forwarding but 2x- 4x throughput slowdown for heavyweight processing applica5ons

53 Individual Applica5ons Each applica5on is wri_en as a GPU kernel No CPU- side development Speedup over a single CPU- core Applica[ons GASPP (8192 batch) GASPP (1024 batch) Firewall 3.6x 3.6x StringMatch 28.4x 9.3x RegExMatch 173.1x 36.9x AES 14.6x 6.5x

54 Consolida5ng Applica5ons GASPP: Firewall - Firewall StringMatch Firewall StringMatch RegExMatch Firewall StringMatch RegExMatch AES 1.19x 2.12x 1.93x GASPP reduces irregular execu5on by X

55 Conclusions What we offer: Fast inter- device data transfers GPU- based flow state management and stream reconstruc5on Efficient packet scheduling mechanisms Limita5ons High packet processing latency

56 GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis, FORTH- ICS, Greece

GASPP: A GPU-Accelerated Stateful Packet Processing Framework

GASPP: A GPU-Accelerated Stateful Packet Processing Framework GASPP: A GPU-Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis and Lazaros Koromilas, FORTH-ICS; Michalis Polychronakis, Columbia University; Sotiris Ioannidis, FORTH-ICS https://www.usenix.org/conference/atc14/technical-sessions/presentation/vasiliadis

More information

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,

More information

소프트웨어기반고성능침입탐지시스템설계및구현

소프트웨어기반고성능침입탐지시스템설계및구현 소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical

More information

G-NET: Effective GPU Sharing In NFV Systems

G-NET: Effective GPU Sharing In NFV Systems G-NET: Effective Sharing In NFV Systems Kai Zhang*, Bingsheng He^, Jiayu Hu #, Zeke Wang^, Bei Hua #, Jiayi Meng #, Lishan Yang # *Fudan University ^National University of Singapore #University of Science

More information

GPGPU introduction and network applications. PacketShaders, SSLShader

GPGPU introduction and network applications. PacketShaders, SSLShader GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router

More information

The Power of Batching in the Click Modular Router

The Power of Batching in the Click Modular Router The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering

More information

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

NetSlices: Scalable Mul/- Core Packet Processing in User- Space NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated

More information

PacketShader as a Future Internet Platform

PacketShader as a Future Internet Platform PacketShader as a Future Internet Platform AsiaFI Summer School 2011.8.11. Sue Moon in collaboration with: Joongi Kim, Seonggu Huh, Sangjin Han, Keon Jang, KyoungSoo Park Advanced Networking Lab, CS, KAIST

More information

Security Applica.ons of GPUs. Giorgos Vasiliadis Founda.on for Research and Technology Hellas (FORTH)

Security Applica.ons of GPUs. Giorgos Vasiliadis Founda.on for Research and Technology Hellas (FORTH) Security Applica.ons of GPUs Founda.on for Research and Technology Hellas (FORTH) Outline Background and mo.va.on GPU- based Malware Signature- based Detec.on Network intrusion detec.on/preven.on Virus

More information

Recent Advances in Software Router Technologies

Recent Advances in Software Router Technologies Recent Advances in Software Router Technologies KRNET 2013 2013.6.24-25 COEX Sue Moon In collaboration with: Sangjin Han 1, Seungyeop Han 2, Seonggu Huh 3, Keon Jang 4, Joongi Kim, KyoungSoo Park 5 Advanced

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

Network Coding: Theory and Applica7ons

Network Coding: Theory and Applica7ons Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra

More information

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Improving Packet Processing Performance of a Memory- Bounded Application

Improving Packet Processing Performance of a Memory- Bounded Application Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Parallelizing IPsec: switching SMP to On is not even half the way

Parallelizing IPsec: switching SMP to On is not even half the way Parallelizing IPsec: switching SMP to On is not even half the way Steffen Klassert secunet Security Networks AG Dresden June 11 2010 Table of contents Some basics about IPsec About the IPsec performance

More information

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Rhythm: Harnessing Data Parallel Hardware for Server Workloads Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth

More information

Gregex: GPU based High Speed Regular Expression Matching Engine

Gregex: GPU based High Speed Regular Expression Matching Engine 11 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Gregex: GPU based High Speed Regular Expression Matching Engine Lei Wang 1, Shuhui Chen 2, Yong Tang

More information

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel David Lebrun, Olivier Bonaventure ICTEAM, UCLouvain Work supported by ARC grant 12/18-054 (ARC-SDN) and a Cisco grant Agenda IPv6 Segment Rou'ng Implementa'on

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

ntop Users Group Meeting

ntop Users Group Meeting ntop Users Group Meeting PF_RING Tutorial Alfredo Cardigliano Overview Introduction Installation Configuration Tuning Use cases PF_RING Open source packet processing framework for

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

GPU Cluster Computing. Advanced Computing Center for Research and Education

GPU Cluster Computing. Advanced Computing Center for Research and Education GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of

More information

An Intelligent NIC Design Xin Song

An Intelligent NIC Design Xin Song 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection Lucas Vespa Department of Computer Science University of Illinois at Springfield lvesp@uis.edu Ning Weng Department of Electrical and Computer

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

Avid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System

Avid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System Avid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System Page 1 of 14 Dave Pimm Avid Technology April 25, 2018 1.) Lenovo P720 AVID Qualified System Specification: P720 Hardware

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

A Next Generation Home Access Point and Router

A Next Generation Home Access Point and Router A Next Generation Home Access Point and Router Product Marketing Manager Network Communication Technology and Application of the New Generation Points of Discussion Why Do We Need a Next Gen Home Router?

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

PASTE: A Network Programming Interface for Non-Volatile Main Memory

PASTE: A Network Programming Interface for Non-Volatile Main Memory PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme NET1343BU NSX Performance Samuel Kommu #VMworld #NET1343BU Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no

More information

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Supra-linear Packet Processing Performance with Intel Multi-core Processors White Paper Dual-Core Intel Xeon Processor LV 2.0 GHz Communications and Networking Applications Supra-linear Packet Processing Performance with Intel Multi-core Processors 1 Executive Summary Advances

More information

Networking at the Speed of Light

Networking at the Speed of Light Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

27 March 2018 Mikael Arguedas and Morgan Quigley

27 March 2018 Mikael Arguedas and Morgan Quigley 27 March 2018 Mikael Arguedas and Morgan Quigley Separate devices: (prototypes 0-3) Unified camera: (prototypes 4-5) Unified system: (prototypes 6+) USB3 USB Host USB3 USB2 USB3 USB Host PCIe root

More information

Multi-Layer Packet Classification with Graphics Processing Units

Multi-Layer Packet Classification with Graphics Processing Units Multi-Layer Packet Classification with Graphics Processing Units Matteo Varvello, Rafael Laufer, Feixiong Zhang, T.V. Lakshman Telefonica Research, matteo.varvello@telefonica.com Bell Labs, {firstname.lastname}@alcatel-lucent.com

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

Control Center 15 Performance Reference Guide

Control Center 15 Performance Reference Guide Control Center 15 Performance Reference Guide Control Center front-end application This guide provides information about Control Center 15 components that may be useful when planning a system. System specifications

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

ASN Configuration Best Practices

ASN Configuration Best Practices ASN Configuration Best Practices Managed machine Generally used CPUs and RAM amounts are enough for the managed machine: CPU still allows us to read and write data faster than real IO subsystem allows.

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Benchmark Tests of Asterisk as a B2BUA

Benchmark Tests of Asterisk as a B2BUA Benchmark Tests of Asterisk as a B2BUA Astricon 28 Jim.Dalton@TransNexus.com Why Test Methodology Results Agenda V1.4, 32 bit Fedora, Dual Xeon-Dual Core V1.4, 64 bit Redhat, Xeon Quad Core V1.6, 64 bit

More information

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Speeding up Linux TCP/IP with a Fast Packet I/O Framework Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1 Motivation

More information

DRAM Bank Organization

DRAM Bank Organization DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Demystifying Network Cards

Demystifying Network Cards Demystifying Network Cards Paul Emmerich December 27, 2017 Chair of Network Architectures and Services About me PhD student at Researching performance of software packet processing systems Mostly working

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C

Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C Page 1 of 14 Dave Pimm Avid Technology April 25, 2018 1.) Lenovo P520 & P520C AVID Qualified System

More information

Avid Configuration Guidelines DELL 3430 workstation Tower 4 or 6 Core CPU System

Avid Configuration Guidelines DELL 3430 workstation Tower 4 or 6 Core CPU System Avid Configuration Guidelines DELL 3430 workstation Tower 4 or 6 Core CPU System Ports & Slots 1. Power Button 2. Optical Drive (optional) 3. Internal SD Card reader (Optional) 4. Universal Audio Jack

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Avid Configuration Guidelines DELL 3930 workstation 1U Rack 4 or 6 Core CPU System

Avid Configuration Guidelines DELL 3930 workstation 1U Rack 4 or 6 Core CPU System Avid Configuration Guidelines DELL 3930 workstation 1U Rack 4 or 6 Core CPU System Ports & Slots 1. HDD activity light 2. Hard drive (2x3.5") (or 4x2.5" not shown) 3. Audio jack 4. USB 3.1 (Type C) 5.

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

Avid Configuration Guidelines HP Z8 G4 workstation Dual 8 to 28 Core CPU System

Avid Configuration Guidelines HP Z8 G4 workstation Dual 8 to 28 Core CPU System Avid Configuration Guidelines HP Z8 G4 workstation Dual 8 to 28 Core CPU System Page 1 of 13 Dave Pimm Avid Technology April 23, 2018 1.) HP Z8 G4 AVID Qualified System Specification: Z8 G4 Hardware Configuration

More information

Agenda. General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer.

Agenda. General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer. UNIT I: OVERVIEW Agenda General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer. Architecture & Organiza/on Computer Architecture is those abributes

More information

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

SoNIC: Precise Real1me So3ware Access and Control of Wired Networks. Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University

SoNIC: Precise Real1me So3ware Access and Control of Wired Networks. Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University SoNIC: Precise Real1me So3ware Access and Control of Wired s Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University 4/11/13 SoNIC NSDI 2013 1 Interpacket Delay and Research Link Interpacket gap, spacing,

More information

FAQ. Release rc2

FAQ. Release rc2 FAQ Release 19.02.0-rc2 January 15, 2019 CONTENTS 1 What does EAL: map_all_hugepages(): open failed: Permission denied Cannot init memory mean? 2 2 If I want to change the number of hugepages allocated,

More information

The NE010 iwarp Adapter

The NE010 iwarp Adapter The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

FlexNIC: Rethinking Network DMA

FlexNIC: Rethinking Network DMA FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Ports & Slots 1. System Status Indicator 2. Hard drive (x8) 3. USB 3.0 connector 4. Optical-drive (optional) 5. Information

More information

LANCOM Techpaper Routing Performance

LANCOM Techpaper Routing Performance LANCOM Techpaper Routing Performance Applications for communications and entertainment are increasingly based on IP networks. In order to ensure that the necessary bandwidth performance can be provided

More information

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle

More information

Accelerate Applications Using EqualLogic Arrays with directcache

Accelerate Applications Using EqualLogic Arrays with directcache Accelerate Applications Using EqualLogic Arrays with directcache Abstract This paper demonstrates how combining Fusion iomemory products with directcache software in host servers significantly improves

More information

10GE network tests with UDP. Janusz Szuba European XFEL

10GE network tests with UDP. Janusz Szuba European XFEL 10GE network tests with UDP Janusz Szuba European XFEL Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary 10GE

More information

PC BUILDING PRESENTED BY

PC BUILDING PRESENTED BY PC BUILDING PRESENTED BY WHAT IS A PC General purpose Personal Computer for individual usage Macintosh 1984 WHAT IS A PC General purpose Personal Computer for individual usage IBM Personal Computer XT

More information

Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System

Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System Page 1 of 14 Dave Pimm Avid Technology Jan 16, 2018 1.) HP Z6 G4 AVID Qualified System Specification: Z6 G4 Hardware Configuration

More information

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Ports & Slots 1. System Status Indicator 2. Hard drive (x8) 3. USB 3.0 connector 4. Optical-drive (optional) 5. Information

More information

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect OpenOnload Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect Copyright 2012 Solarflare Communications, Inc. All Rights Reserved. OpenOnload Acceleration Software Accelerated

More information

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers The Missing Piece of Virtualization I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers Agenda 10 GbE Adapters Built for Virtualization I/O Throughput: Virtual & Non-Virtual Servers Case

More information

초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)

초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST) 초고속네트워크보안시스템설계및구현 KyoungSoo Park School of Electrical Engineering, KAIST (Collaboration with many students & faculty members at KAIST) Agenda High-performance packet processing on x86 systems Kargus High-performance

More information

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers Mihai Dobrescu and etc. SOSP 2009 Presented by Shuyi Chen Mo2va2on Router design Performance Extensibility They are compe2ng goals Hardware approach

More information

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

A Comparison of Performance and Accuracy of Measurement Algorithms in Software A Comparison of Performance and Accuracy of Measurement Algorithms in Software Omid Alipourfard, Masoud Moshref 1, Yang Zhou 2, Tong Yang 2, Minlan Yu 3 Yale University, Barefoot Networks 1, Peking University

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc. The Convergence of Storage and Server Virtualization 2007 Solarflare Communications, Inc. About Solarflare Communications Privately-held, fabless semiconductor company. Founded 2001 Top tier investors:

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection Yi-Shan Lin 1, Chun-Liang Lee 2*, Yaw-Chung Chen 1 1 Department of Computer Science, National Chiao Tung University,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information