GASPP: A GPU- Accelerated Stateful Packet Processing Framework
|
|
- Barbra Julia Pitts
- 6 years ago
- Views:
Transcription
1 GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis, FORTH- ICS, Greece
2 Network Packet Processing Computa5onally and memory- intensive High levels of data parallelism Each packet can be processed in parallel Poor temporal locality for data Typically, each packet is processed only once
3 GPU = Graphics Processing Units Highly parallel manycore devices Hundreds of cores High memory bandwidth Up to 6GB of memory
4 GPUs for Network Packet Processing Gnort [RAID 08] PacketShader [SIGCOMM 10] SSLShader [NSDI 11] MIDeA [CCS 11], Kargus [CCS 12]
5 GPUs for Network Packet Processing Gnort [RAID 08] PacketShader [SIGCOMM 10] SSLShader [NSDI 11] MIDeA [CCS 11], Kargus [CCS 12] Independent/Monolithic Designs 1. A lot of CPU- side code even for simple apps 2. Explicit batching 3. Explicit data copies and PCIe transfers
6 Need a framework for developing GPU accelerated packet processing applica5ons
7 GASPP Framework AES Regex Match String Match Firewall TCP Flow State Management Packet Reordering Packet Scheduling Packet Decoding
8 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support
9 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support
10 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Use a single user- space buffer between the NIC and the GPU Stage packets back- to- back to a separate buffer
11 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Packets size (#bytes) Gbit/s >! Packets size (#bytes) Gbit/s
12 Fast user- space packet capturing Main Memory Main Memory NIC DMA Buffer GPU GPU DMA Buffer CPU NIC DMA Buffer Packets size (#bytes) Gbit/s <! Packets size (#bytes) Gbit/s
13 Why staging is be_er than zero- copy (for small packets) NIC s Packet Buffer: Staging buffer: Be1er space u6liza6on => No redundant transfers
14 Selec5ve scheme Main Memory Main Memory GPU DMA Buffer NIC DMA Buffer GPU CPU NIC DMA Buffer Packets are are copied back- to- back to a separate buffer, if the buffer occupancy is sparse Otherwise, they are transferred directly to the GPU
15 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support
16 Modular and Flexible Basic abstrac5on of processing: ``modules processpacket(packet){... } RX Modules are executed sequen5ally or in parallel processpacket( ) processpacket( ) processpacket( ) TX Module 1: IP- learn Module 2: Content Inspection Module 3: Encryption
17 Batch Processing Pipeline [me RX Module1 Module2 TX
18 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch
19 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing
20 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing
21 Batch Processing Pipeline [me RX Module1 Module2 TX RX batch copy to GPU Batch processing copy to CPU TX batch
22 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support
23 Single Instruc5on, Mul5ple Threads SIMT group (warp) Threads within the same warp have to execute the same instruc5ons Great for regular computaeons!
24 Parallelism in packet processing Network traffic Batch Size (#packets) Network packets are processed in batches More packets => more parallelism
25 Dynamic Irregulari5es Batch Size (#packets) Received network packets mix is very dynamic
26 Dynamic Irregulari5es Batch Size (#packets) Received network packets mix is very dynamic Different packet lengths
27 Dynamic Irregulari5es module 1 module 2 module 3 Batch Size (#packets) Received network packets mix is very dynamic Different packet lengths Divergent parallel module processing
28 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7
29 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7
30 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7
31 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7
32 Dynamic Irregulari5es module 1 module 2 module 3 [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 Low warp occupancy
33 Packet grouping Batch Size Batch Size
34 Packet grouping [me warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 Ø Harmonized execu5on Ø Symmetric processing
35 GASPP Framework Fast user- space packet capturing Modular and flexible Efficient packet scheduling mechanisms TCP processing and flow management support
36 TCP Flow State Management Connection Records Connection Record Hash key : 4 bytes State : 1 byte Seq CLIENT : 4 bytes Seq SERVER : 4 bytes Next : 4 bytes Connection Table Maintain the state of TCP connec5ons
37 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3
38 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing
39 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing
40 TCP Stream Reassembly Batch Size connec5on1 connec5on2 connec5on3 Sequen6al processing Packet- level parallel processing
41 TCP Stream Reassembly Batch Size A B C Key insight Packets <A, B> are consecu5ve if Seq B = (Seq A +len A )
42 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A A
43 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A B A B
44 TCP Stream Reassembly Batch Size A B C H(Seq) H(seq+len) A B A C B C
45 TCP Stream Reassembly Batch Size A B C A B A C B C Parallel Processing
46 TCP Stream Reassembly Batch Size A B C A B A C B C next_packet: A B C index: A B - C
47 Other TCP corner cases TCP sequence holes Out- of- order packets
48 Other TCP corner cases TCP sequence holes Out- of- order packets
49 Evalua5on Forwarding Latency Individual Applica5ons Consolidated applica5ons
50 Evalua5on Setup generated traffic Packet generator (4x 10GbE ports) 40 Gbit/s forwarded traffic GASPP machine (4x 10GbE ports) GASPP machine has: 2x NUMA nodes (Intel Xeon E GHz quad- core CPUs) 2x banks of 6GB of DDR3 1066MHz RAM 2x Intel 82599EB network adapters (with dual 10GbE ports) 2x NVIDIA GTX480 graphics cards
51 Basic Forwarding Throughput (Gbit/s) Rx+Tx Rx+GPU+Tx Effective Packet size (bytes) Rx+Tx CPU (8x cores) GASPP Rx+GPU+Tx Effective
52 Latency 8192 batch: CPU: 0.48 us GASPP: 3.87 ms 1024 batch: 0.49 ms Same performance for basic forwarding but 2x- 4x throughput slowdown for heavyweight processing applica5ons
53 Individual Applica5ons Each applica5on is wri_en as a GPU kernel No CPU- side development Speedup over a single CPU- core Applica[ons GASPP (8192 batch) GASPP (1024 batch) Firewall 3.6x 3.6x StringMatch 28.4x 9.3x RegExMatch 173.1x 36.9x AES 14.6x 6.5x
54 Consolida5ng Applica5ons GASPP: Firewall - Firewall StringMatch Firewall StringMatch RegExMatch Firewall StringMatch RegExMatch AES 1.19x 2.12x 1.93x GASPP reduces irregular execu5on by X
55 Conclusions What we offer: Fast inter- device data transfers GPU- based flow state management and stream reconstruc5on Efficient packet scheduling mechanisms Limita5ons High packet processing latency
56 GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis, FORTH- ICS, Greece
GASPP: A GPU-Accelerated Stateful Packet Processing Framework
GASPP: A GPU-Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis and Lazaros Koromilas, FORTH-ICS; Michalis Polychronakis, Columbia University; Sotiris Ioannidis, FORTH-ICS https://www.usenix.org/conference/atc14/technical-sessions/presentation/vasiliadis
More informationPacketShader: A GPU-Accelerated Software Router
PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,
More information소프트웨어기반고성능침입탐지시스템설계및구현
소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical
More informationG-NET: Effective GPU Sharing In NFV Systems
G-NET: Effective Sharing In NFV Systems Kai Zhang*, Bingsheng He^, Jiayu Hu #, Zeke Wang^, Bei Hua #, Jiayi Meng #, Lishan Yang # *Fudan University ^National University of Singapore #University of Science
More informationGPGPU introduction and network applications. PacketShaders, SSLShader
GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router
More informationThe Power of Batching in the Click Modular Router
The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering
More informationNetSlices: Scalable Mul/- Core Packet Processing in User- Space
NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated
More informationPacketShader as a Future Internet Platform
PacketShader as a Future Internet Platform AsiaFI Summer School 2011.8.11. Sue Moon in collaboration with: Joongi Kim, Seonggu Huh, Sangjin Han, Keon Jang, KyoungSoo Park Advanced Networking Lab, CS, KAIST
More informationSecurity Applica.ons of GPUs. Giorgos Vasiliadis Founda.on for Research and Technology Hellas (FORTH)
Security Applica.ons of GPUs Founda.on for Research and Technology Hellas (FORTH) Outline Background and mo.va.on GPU- based Malware Signature- based Detec.on Network intrusion detec.on/preven.on Virus
More informationRecent Advances in Software Router Technologies
Recent Advances in Software Router Technologies KRNET 2013 2013.6.24-25 COEX Sue Moon In collaboration with: Sangjin Han 1, Seungyeop Han 2, Seonggu Huh 3, Keon Jang 4, Joongi Kim, KyoungSoo Park 5 Advanced
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationNetwork Coding: Theory and Applica7ons
Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU) Plan Hello World! Intra
More informationEfficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationProfiling & Tuning Applica1ons. CUDA Course July István Reguly
Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationParallelizing IPsec: switching SMP to On is not even half the way
Parallelizing IPsec: switching SMP to On is not even half the way Steffen Klassert secunet Security Networks AG Dresden June 11 2010 Table of contents Some basics about IPsec About the IPsec performance
More informationRhythm: Harnessing Data Parallel Hardware for Server Workloads
Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth
More informationGregex: GPU based High Speed Regular Expression Matching Engine
11 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Gregex: GPU based High Speed Regular Expression Matching Engine Lei Wang 1, Shuhui Chen 2, Yong Tang
More informationImplemen'ng IPv6 Segment Rou'ng in the Linux Kernel
Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel David Lebrun, Olivier Bonaventure ICTEAM, UCLouvain Work supported by ARC grant 12/18-054 (ARC-SDN) and a Cisco grant Agenda IPv6 Segment Rou'ng Implementa'on
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationntop Users Group Meeting
ntop Users Group Meeting PF_RING Tutorial Alfredo Cardigliano Overview Introduction Installation Configuration Tuning Use cases PF_RING Open source packet processing framework for
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationGPU Cluster Computing. Advanced Computing Center for Research and Education
GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of
More informationAn Intelligent NIC Design Xin Song
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationSWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection
SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection Lucas Vespa Department of Computer Science University of Illinois at Springfield lvesp@uis.edu Ning Weng Department of Electrical and Computer
More informationHigh-Performance Packet Classification on GPU
High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction
More informationAvid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System
Avid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System Page 1 of 14 Dave Pimm Avid Technology April 25, 2018 1.) Lenovo P720 AVID Qualified System Specification: P720 Hardware
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationA Next Generation Home Access Point and Router
A Next Generation Home Access Point and Router Product Marketing Manager Network Communication Technology and Application of the New Generation Points of Discussion Why Do We Need a Next Gen Home Router?
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationPASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018
More informationDisclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme
NET1343BU NSX Performance Samuel Kommu #VMworld #NET1343BU Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no
More informationSupra-linear Packet Processing Performance with Intel Multi-core Processors
White Paper Dual-Core Intel Xeon Processor LV 2.0 GHz Communications and Networking Applications Supra-linear Packet Processing Performance with Intel Multi-core Processors 1 Executive Summary Advances
More informationNetworking at the Speed of Light
Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More information27 March 2018 Mikael Arguedas and Morgan Quigley
27 March 2018 Mikael Arguedas and Morgan Quigley Separate devices: (prototypes 0-3) Unified camera: (prototypes 4-5) Unified system: (prototypes 6+) USB3 USB Host USB3 USB2 USB3 USB Host PCIe root
More informationMulti-Layer Packet Classification with Graphics Processing Units
Multi-Layer Packet Classification with Graphics Processing Units Matteo Varvello, Rafael Laufer, Feixiong Zhang, T.V. Lakshman Telefonica Research, matteo.varvello@telefonica.com Bell Labs, {firstname.lastname}@alcatel-lucent.com
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationHigh bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK
High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459
More informationControl Center 15 Performance Reference Guide
Control Center 15 Performance Reference Guide Control Center front-end application This guide provides information about Control Center 15 components that may be useful when planning a system. System specifications
More informationHigh Performance Packet Processing with FlexNIC
High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet
More informationASN Configuration Best Practices
ASN Configuration Best Practices Managed machine Generally used CPUs and RAM amounts are enough for the managed machine: CPU still allows us to read and write data faster than real IO subsystem allows.
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationBenchmark Tests of Asterisk as a B2BUA
Benchmark Tests of Asterisk as a B2BUA Astricon 28 Jim.Dalton@TransNexus.com Why Test Methodology Results Agenda V1.4, 32 bit Fedora, Dual Xeon-Dual Core V1.4, 64 bit Redhat, Xeon Quad Core V1.6, 64 bit
More informationSpeeding up Linux TCP/IP with a Fast Packet I/O Framework
Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1 Motivation
More informationDRAM Bank Organization
DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationDemystifying Network Cards
Demystifying Network Cards Paul Emmerich December 27, 2017 Chair of Network Architectures and Services About me PhD student at Researching performance of software packet processing systems Mostly working
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAvid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C
Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C Page 1 of 14 Dave Pimm Avid Technology April 25, 2018 1.) Lenovo P520 & P520C AVID Qualified System
More informationAvid Configuration Guidelines DELL 3430 workstation Tower 4 or 6 Core CPU System
Avid Configuration Guidelines DELL 3430 workstation Tower 4 or 6 Core CPU System Ports & Slots 1. Power Button 2. Optical Drive (optional) 3. Internal SD Card reader (Optional) 4. Universal Audio Jack
More informationScaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic
More informationAvid Configuration Guidelines DELL 3930 workstation 1U Rack 4 or 6 Core CPU System
Avid Configuration Guidelines DELL 3930 workstation 1U Rack 4 or 6 Core CPU System Ports & Slots 1. HDD activity light 2. Hard drive (2x3.5") (or 4x2.5" not shown) 3. Audio jack 4. USB 3.1 (Type C) 5.
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationAvid Configuration Guidelines HP Z8 G4 workstation Dual 8 to 28 Core CPU System
Avid Configuration Guidelines HP Z8 G4 workstation Dual 8 to 28 Core CPU System Page 1 of 13 Dave Pimm Avid Technology April 23, 2018 1.) HP Z8 G4 AVID Qualified System Specification: Z8 G4 Hardware Configuration
More informationAgenda. General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer.
UNIT I: OVERVIEW Agenda General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer. Architecture & Organiza/on Computer Architecture is those abributes
More informationDCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture
DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,
More informationMuch Faster Networking
Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationSoNIC: Precise Real1me So3ware Access and Control of Wired Networks. Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University
SoNIC: Precise Real1me So3ware Access and Control of Wired s Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University 4/11/13 SoNIC NSDI 2013 1 Interpacket Delay and Research Link Interpacket gap, spacing,
More informationFAQ. Release rc2
FAQ Release 19.02.0-rc2 January 15, 2019 CONTENTS 1 What does EAL: map_all_hugepages(): open failed: Permission denied Cannot init memory mean? 2 2 If I want to change the number of hugepages allocated,
More informationThe NE010 iwarp Adapter
The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationFlexNIC: Rethinking Network DMA
FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationBe Fast, Cheap and in Control with SwitchKV. Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level
More informationAvid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System
Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Ports & Slots 1. System Status Indicator 2. Hard drive (x8) 3. USB 3.0 connector 4. Optical-drive (optional) 5. Information
More informationLANCOM Techpaper Routing Performance
LANCOM Techpaper Routing Performance Applications for communications and entertainment are increasingly based on IP networks. In order to ensure that the necessary bandwidth performance can be provided
More informationAdvanced CUDA Optimizing to Get 20x Performance. Brent Oster
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationAccelerate Applications Using EqualLogic Arrays with directcache
Accelerate Applications Using EqualLogic Arrays with directcache Abstract This paper demonstrates how combining Fusion iomemory products with directcache software in host servers significantly improves
More information10GE network tests with UDP. Janusz Szuba European XFEL
10GE network tests with UDP Janusz Szuba European XFEL Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary 10GE
More informationPC BUILDING PRESENTED BY
PC BUILDING PRESENTED BY WHAT IS A PC General purpose Personal Computer for individual usage Macintosh 1984 WHAT IS A PC General purpose Personal Computer for individual usage IBM Personal Computer XT
More informationAvid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System
Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System Page 1 of 14 Dave Pimm Avid Technology Jan 16, 2018 1.) HP Z6 G4 AVID Qualified System Specification: Z6 G4 Hardware Configuration
More informationAvid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System
Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System Ports & Slots 1. System Status Indicator 2. Hard drive (x8) 3. USB 3.0 connector 4. Optical-drive (optional) 5. Information
More informationOpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect
OpenOnload Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect Copyright 2012 Solarflare Communications, Inc. All Rights Reserved. OpenOnload Acceleration Software Accelerated
More informationThe Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers
The Missing Piece of Virtualization I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers Agenda 10 GbE Adapters Built for Virtualization I/O Throughput: Virtual & Non-Virtual Servers Case
More information초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)
초고속네트워크보안시스템설계및구현 KyoungSoo Park School of Electrical Engineering, KAIST (Collaboration with many students & faculty members at KAIST) Agenda High-performance packet processing on x86 systems Kargus High-performance
More informationRouteBricks: Exploi2ng Parallelism to Scale So9ware Routers
RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers Mihai Dobrescu and etc. SOSP 2009 Presented by Shuyi Chen Mo2va2on Router design Performance Extensibility They are compe2ng goals Hardware approach
More informationA Comparison of Performance and Accuracy of Measurement Algorithms in Software
A Comparison of Performance and Accuracy of Measurement Algorithms in Software Omid Alipourfard, Masoud Moshref 1, Yang Zhou 2, Tong Yang 2, Minlan Yu 3 Yale University, Barefoot Networks 1, Peking University
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationThe Convergence of Storage and Server Virtualization Solarflare Communications, Inc.
The Convergence of Storage and Server Virtualization 2007 Solarflare Communications, Inc. About Solarflare Communications Privately-held, fabless semiconductor company. Founded 2001 Top tier investors:
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationA Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection
A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection Yi-Shan Lin 1, Chun-Liang Lee 2*, Yaw-Chung Chen 1 1 Department of Computer Science, National Chiao Tung University,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More information