A Real Time Controller for E-ELT

Size: px
Start display at page:

Download "A Real Time Controller for E-ELT"

Transcription

1 A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project # funded by European Commission under program H2020-EU coordinated in H2020-FETHPC-2014

2 Green Flash RTC prototypes or E-ELT AO system European project: Horizon 2020 Research and innovation program 3 years project (a year has passed already) Low cost and low energy consumption RTC Different kind of accelerators characterization GPU FPGA XeonPhi Public/private sector partnership

3 E-ELT Numbers & Methods Case Method Dimension Encoding Frequency Size Throughput MCAO SCAO Shack- Hartmann Shack- Hartmann 1.6k x 1.6k 16 bits 500Hz 800 x bits 1000Hz SCAO Pyramid 240 x bits 1000Hz 40.96Mb (5,12Mo) 10.24Mb (1,28Mo) ~1Mb (~125Ko) Gb/s Gb/s 1 Gb/s Sequential Slopes/MVM processing Segmented Slopes/MVM processing

4 E-ELT Numbers & Methods Case Method Dimension Encoding Frequency Size Throughput MCAO SCAO Shack- Hartmann Shack- Hartmann 1.6k x 1.6k 16 bits 500Hz 800 x bits 1000Hz SCAO Pyramid 240 x bits 1000Hz 40.96Mb (5,12Mo) 10.24Mb (1,28Mo) ~1Mb (~125Ko) Gb/s Gb/s 1 Gb/s MCAO Case: 40,96 Mb with 40Gb/s network => 1ms latency

5 Latency and jitter constraints Latency typical optic fiber: 4,9µs/Km Closest city 130km (Antofagasta) implies 0.630ms latency (+transfer time) Jitter Under 10% of overall latency in order to be manageable Too much jitter Too much frame skips Correction stability hard to reach MCAO 2ms/iter, SCAO 1ms/iter The nearer the better The lesser the better

6 Usual Telescope RTC SPARTA Design for VLT scales RTC BOX: x86 CPU FPGA DSP Co-processing Cluster: X86 CPU At ELT scales: needs for a super-calculator in the observatory

7 E-ELT GreenFlash RTC Prototype Sensor(s) Low latency Low jitter DM 40 GbE network High bandwidth Low latency Real Time Controller 40 GbE network High bandwidth Supervisor Telemetry High framerate High bandwidth Low latency Fast storage high throughput High bandwidth High throughput

8 Legacy GPU programming main { setup(); while(run){ recv( ); cudamemcpy(, HostToDevice); computing_kernel<<<>>>( ); cudamemcpy(, DeviceToHost); send( ); } } PCIe GPU GPU RAM 10GbE NIC CPU CPU RAM

9 Legacy GPU Programming cudamemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec

10 Legacy GPU Programming cudamemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)

11 Legacy GPU programming MCAO : image 40.96Mb, commands 512Kb Over 40GbE network transfer takes almost 1ms Same for cudamemcpy() operations camera integration transfer i i+1 i+2 i-1 i i+1 transfer handling data / exec handling sequential processing segmented processing Leaves not enough to no time for computations Dec 21st 2016

12 GPUDirect + Custom FPGA NIC Allows third party PCI-e device p2p access Goals : Negates latency overhead and reduce jitter induced by cudamemcpy cudamemcopy() overhead times (5,12Mo in, 64Ko out) Linux Kernel module: expose CUDA buffers phy@

13 Persistent CUDA kernel Exports computation loop on GPU Goals : Reduce kernel launch jitter & start computations as soon as data arrive Kernel launches overhead times Uses memory polling for data arrival detection

14 Persistent kernel + GPUDirect main { setup(); persistent_kernel<<<>>>( ); while(run){ waitgpu( ); startdmatransfer( ); } } persistent_kernel( ){ while(run){ pollmemory( ); notifycpu( ); } } PCIe GPU notify GPU RAM start CPU Problem: GPU can t command FPGA Mapped CPU memory in CUDA for GPU/CPU notification 10GbE FPGA NIC CPU RAM

15 Persistent kernel + GPUDirect FPGA writes/reads directly to/from GPU memory Using only writes would be better though DMA Host ram CPU app Camera control UDP Offload Engine Latency measurement Camera protocol handler DMC protocol handler FPGA NIC measures answers DMA DMA DMA DMA PCI-e start GPU ram Pixels ring buffer DM com buffer FPGA control Sync thread GPU polling kernel compute kernels notify

16 GPUDirect & FPGA NIC - RTT FPGA PLDA XPressG5 GPU Tesla C2070 OS Debian wheezy Camera EVT HS-2000M 10GbE network µsec SCAO Pyramid case: 240 x 240 pixels, encoded on 16b No GPUDirect GPUDirect + persistent kernel Results are coherent with expectations iterations

17 Persistent kernel + GPUDirect If it scales to MCAO case camera integration i i+1 i+2 transfer transfer handling data / exec handling sequential processing segmented processing i-1 i i+1

18 IOMemory mapping (CUDA 7.5) cumemhostregister( ) using CU_MEMHOSTREGISTER_IOMEMORY flag main { setup(); persistent_kernel <<<>>>( ); } persistent_kernel( ){ while(run){ pollmemory( ); actual_computation; startdmatransfer( ); } } PCIe GPU GPU RAM Mapping FPGA addresses into CUDA memory space allows DMA control fromgpu start CPU 10GbE FPGA NIC CPU RAM

19 IOMemory mapping UDP Offload Engine Latency measurement Camera protocol handler DMC protocol handler FPGA NIC measures answers DMA DMA DMA DMA DMA PCI-e start Host ram GPU ram Pixels ring buffer DM com buffer CPU app Camera control FPGA control Meas. Comp. GPU polling kernel compute kernels Little to no improvements, but CPU free for other kind of computations

20 Conclusion / Perspectives Using GPUDirect and a persistent kernel allow efficient data delivery to the Real Time Controller GPUDirect approach can be applied to Supervisor using interruptions from FPGA for CPU execution control (kernel launch) Simulation setup to benchmark ELT SCAO/MCAO scales thoroughly Test with new hardware: PLDA ExpressKUS FPGA and Nvidia K40/80 & P100 GPUs. (and Arria10 FPGAs in a near future) Develop computation modules on FPGA to further reduce GPU load e.g.: Slopes computation for segmented processing

21 Thank you for your attention (Questions?) Project # funded by European Commission under program H2020-EU coordinated in H2020-FETHPC-2014

A Real Time Controller for E-ELT

A Real Time Controller for E-ELT A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2

More information

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal. Julien BERNARD

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal. Julien BERNARD Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in

More information

Energy efficient real-time computing for extremely large telescopes with GPU

Energy efficient real-time computing for extremely large telescopes with GPU Energy efficient real-time computing for extremely large telescopes with GPU Florian Ferreira & Damien Gratadour Observatoire de Paris & Université Paris Diderot 1 Project #671662 funded by European Commission

More information

Driving the next generation of Extremely Large Telescopes using Adaptive Optics with GPUs

Driving the next generation of Extremely Large Telescopes using Adaptive Optics with GPUs Driving the next generation of Extremely Large Telescopes using Adaptive Optics with GPUs Damien Gratadour LESIA, Observatoire de Paris Université Paris Diderot LESIA, Observatoire de Paris ANR grant ANR-12-MONU-0022

More information

Down selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz

Down selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz Down selecting suitable manycore technologies for the ELT AO RTC David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz GFLOPS RTC for AO workshop 27/01/2016 AO RTC Complexity 1.E+05 1.E+04 E-ELT

More information

FPGA based microserver for high performance real-time computing in Adaptive Optics

FPGA based microserver for high performance real-time computing in Adaptive Optics FPGA based microserver for high performance real-time computing in Adaptive Optics C. Patauner a, R. Biasi a, M. Andrighettoni a, G. Angerer a, D. Pescoller a, F. Porta a, D. Gratadour b a Microgate Srl,

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

dcuda: Distributed GPU Computing with Hardware Overlap

dcuda: Distributed GPU Computing with Hardware Overlap dcuda: Distributed GPU Computing with Hardware Overlap Tobias Gysi, Jeremia Bär, Lukas Kuster, and Torsten Hoefler GPU computing gained a lot of popularity in various application domains weather & climate

More information

The SPARTA Platform: Design, Status and. Adaptive Optics Systems (ESO)

The SPARTA Platform: Design, Status and. Adaptive Optics Systems (ESO) The SPARTA Platform: Design, Status and Perspectives Marcos acossuárez Valles aes Adaptive Optics Systems (ESO) msuarez@eso.orgorg SPARTA Platform Targets ESO Standard Platform for Adaptive Optics Real-Time

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Studying GPU based RTC for TMT NFIRAOS

Studying GPU based RTC for TMT NFIRAOS Studying GPU based RTC for TMT NFIRAOS Lianqi Wang Thirty Meter Telescope Project RTC Workshop Dec 04, 2012 1 Outline Tomography with iterative algorithms on GPUs Matri vector multiply approach Assembling

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

27 March 2018 Mikael Arguedas and Morgan Quigley

27 March 2018 Mikael Arguedas and Morgan Quigley 27 March 2018 Mikael Arguedas and Morgan Quigley Separate devices: (prototypes 0-3) Unified camera: (prototypes 4-5) Unified system: (prototypes 6+) USB3 USB Host USB3 USB2 USB3 USB Host PCIe root

More information

ELT-scale real-time control on Intel Xeon Phi and many core CPUs

ELT-scale real-time control on Intel Xeon Phi and many core CPUs ELT-scale real-time control on Intel Xeon Phi and many core CPUs David R. Jenkins, Alastair G. Basden, and Richard M. Myers CfAI, Department of Physics, Durham University, DH1 3LE, UK ABSTRACT The next

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16

Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16 Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16 Przemek Lenkiewicz, Researcher@IBM Netherlands Bernard

More information

Realtime Signal Processing on Embedded GPUs

Realtime Signal Processing on Embedded GPUs Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation

More information

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

MPI + X programming. UTK resources: Rho Cluster with GPGPU   George Bosilca CS462 MPI + X programming UTK resources: Rho Cluster with GPGPU https://newton.utk.edu/doc/documentation/systems/rhocluster George Bosilca CS462 MPI Each programming paradigm only covers a particular spectrum

More information

The NE010 iwarp Adapter

The NE010 iwarp Adapter The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter

More information

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Virtualization, Xen and Denali

Virtualization, Xen and Denali Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele Building Real-Time Professional Visualization Solutions on GPUs Kristof Denolf Samuel Maroy Ronny Dewaele Page 2 Outline Barco s professional visualization solutions The need for performance portability

More information

Matrox Imaging White Paper

Matrox Imaging White Paper Reliable high bandwidth video capture with Matrox Radient Abstract The constant drive for greater analysis resolution and higher system throughput results in the design of vision systems with multiple

More information

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa,

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

I/O Systems. Jo, Heeseung

I/O Systems. Jo, Heeseung I/O Systems Jo, Heeseung Today's Topics Device characteristics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

NVM PCIe Networked Flash Storage

NVM PCIe Networked Flash Storage NVM PCIe Networked Flash Storage Peter Onufryk Microsemi Corporation Santa Clara, CA 1 PCI Express (PCIe) Mid-range/High-end Specification defined by PCI-SIG Software compatible with PCI and PCI-X Reliable,

More information

Accelerating Data Centers Using NVMe and CUDA

Accelerating Data Centers Using NVMe and CUDA Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express

More information

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec Introduction I/O 1 I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections I/O Device Summary I/O 2 I/O System

More information

PCIe 10G SFP+ Network Card

PCIe 10G SFP+ Network Card PCIe 10G SFP+ Network Card User Manual Ver. 1.00 All brand names and trademarks are properties of their respective owners. Contents: Chapter 1: Introduction... 3 1.1 Product Introduction... 3 1.2 Features...

More information

ASTRI/CTA data analysis on parallel and low-power platforms

ASTRI/CTA data analysis on parallel and low-power platforms ICT Workshop INAF, Cefalù 2015 Universidade de São Paulo Instituto de Astronomia, Geofisica e Ciencias Atmosferica ASTRI/CTA data analysis on parallel and low-power platforms Alberto Madonna, Michele Mastropietro

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Interconnection Network for Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

An Intelligent NIC Design Xin Song

An Intelligent NIC Design Xin Song 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational

More information

Mark Falco Oracle Coherence Development

Mark Falco Oracle Coherence Development Achieving the performance benefits of Infiniband in Java Mark Falco Oracle Coherence Development 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

DESIGN AND TESTING OF GPU BASED RTC FOR TMT NFIRAOS

DESIGN AND TESTING OF GPU BASED RTC FOR TMT NFIRAOS Florence, Italy. Adaptive May 2013 Optics for Extremely Large Telescopes III ISBN: 978-88-908876-0-4 DOI: 10.12839/AO4ELT3.13172 DESIGN AND TESTING OF GPU BASED RTC FOR TMT NFIRAOS Lianqi Wang 1,a, 1 Thirty

More information

Reducing adaptive optics latency using Xeon Phi many-core processors

Reducing adaptive optics latency using Xeon Phi many-core processors doi:10.1093/mnras/stv1813 Reducing adaptive optics latency using Xeon Phi many-core processors David Barr, 1,2 Alastair Basden, 3 Nigel Dipper 3 andnoahschwartz 1 1 UK Astronomy Technology Centre, Royal

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Asynchronous Peer-to-Peer Device Communication

Asynchronous Peer-to-Peer Device Communication 13th ANNUAL WORKSHOP 2017 Asynchronous Peer-to-Peer Device Communication Feras Daoud, Leon Romanovsky [ 28 March, 2017 ] Agenda Peer-to-Peer communication PeerDirect technology PeerDirect and PeerDirect

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

Video capture using GigE Vision with MIL. What is GigE Vision

Video capture using GigE Vision with MIL. What is GigE Vision What is GigE Vision GigE Vision is fundamentally a standard for transmitting video from a camera (see Figure 1) or similar device over Ethernet and is primarily intended for industrial imaging applications.

More information

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Implementing Ultra Low Latency Data Center Services with Programmable Logic Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave.,

More information

Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis

Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis, a, Andrea Bulgarelli a, Adriano De Rosa a, Alessio Aboudan a, Valentina Fioretti a, Giovanni De Cesare a, Ramin

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency

More information

European Organization for Astronomical Research in the Southern

European Organization for Astronomical Research in the Southern State-of-the-art detector controller for ESO instruments Leander H. Mehrgan, Domingo Alvarez, Dietrich Baade, Claudio Cumani, Siegfried Eschbaumer, Gert Finger, Christoph Geimer, Derek Ives, Manfred Meyer,

More information

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech

More information

Chapter 6. Storage and Other I/O Topics

Chapter 6. Storage and Other I/O Topics Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

Next Generation Enterprise Solutions from ARM

Next Generation Enterprise Solutions from ARM Next Generation Enterprise Solutions from ARM Ian Forsyth Director Product Marketing Enterprise and Infrastructure Applications Processor Product Line Ian.forsyth@arm.com 1 Enterprise Trends IT is the

More information

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT 7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also

More information

Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators Architecture Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators

More information

Fast End-to-End Multi-Conjugate AO Simulations Using Graphical Processing Units and the MAOS Simulation Code

Fast End-to-End Multi-Conjugate AO Simulations Using Graphical Processing Units and the MAOS Simulation Code Fast End-to-End Multi-Conjugate AO Simulations Using Graphical Processing Units and the MAOS Simulation Code Lianqi Wang 1a and Brent Ellerbroek 1 TMT Observatory Corportaion, 1111 South Arroyo Pkwy Suite

More information

Initial Evaluation of a User-Level Device Driver Framework

Initial Evaluation of a User-Level Device Driver Framework Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers The Missing Piece of Virtualization I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers Agenda 10 GbE Adapters Built for Virtualization I/O Throughput: Virtual & Non-Virtual Servers Case

More information

High-Speed NAND Flash

High-Speed NAND Flash High-Speed NAND Flash Design Considerations to Maximize Performance Presented by: Robert Pierce Sr. Director, NAND Flash Denali Software, Inc. History of NAND Bandwidth Trend MB/s 20 60 80 100 200 The

More information

I/O 1. Devices and I/O. key concepts device registers, device drivers, program-controlled I/O, DMA, polling, disk drives, disk head scheduling

I/O 1. Devices and I/O. key concepts device registers, device drivers, program-controlled I/O, DMA, polling, disk drives, disk head scheduling I/O 1 Devices and I/O key concepts device registers, device drivers, program-controlled I/O, DMA, polling, disk drives, disk head scheduling reading Three Easy Pieces: Chapters 36-37 I/O 2 Sys/161 Device

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications

More information

Chapter 2 Computer-System Structure

Chapter 2 Computer-System Structure Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect OpenOnload Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect Copyright 2012 Solarflare Communications, Inc. All Rights Reserved. OpenOnload Acceleration Software Accelerated

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Nvidia TX2 using CUDA Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of

More information

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved. Ethernet Storage Fabrics Using RDMA with Fast NVMe-oF Storage to Reduce Latency and Improve Efficiency Kevin Deierling & Idan Burstein Mellanox Technologies 1 Storage Media Technology Storage Media Access

More information

RDMA and Hardware Support

RDMA and Hardware Support RDMA and Hardware Support SIGCOMM Topic Preview 2018 Yibo Zhu Microsoft Research 1 The (Traditional) Journey of Data How app developers see the network Under the hood This architecture had been working

More information

Tracing embedded heterogeneous systems

Tracing embedded heterogeneous systems Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, D E C E M B E R 2015 T H O M A S B E R T A U L D D I R E C T E D B Y M I C H E L D A G E N A I S December 10th 2015 TRACING

More information

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1 Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support

More information

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake GPU acceleration on IB clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake HPC Advisory Council European Workshop 2011 Why it matters? (Single node GPU acceleration) Control

More information

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures Programmable Logic Design Grzegorz Budzyń Lecture 15: Advanced hardware in FPGA structures Plan Introduction PowerPC block RocketIO Introduction Introduction The larger the logical chip, the more additional

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

Update on PRad GEMs, Readout Electronics & DAQ

Update on PRad GEMs, Readout Electronics & DAQ Update on PRad GEMs, Readout Electronics & DAQ Kondo Gnanvo University of Virginia, Charlottesville, VA Outline PRad GEMs update Upgrade of SRS electronics Integration into JLab DAQ system Cosmic tests

More information

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint

More information

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb. Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 Input/Output (IO) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke) IO:

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Gemini Observatory. Multi-Conjugate Adaptive Optics Control System. The Gemini MCAO System (ICALEPCS, Geneva, October 2005) 1

Gemini Observatory. Multi-Conjugate Adaptive Optics Control System. The Gemini MCAO System (ICALEPCS, Geneva, October 2005) 1 Gemini Observatory Multi-Conjugate Adaptive Optics Control System The Gemini MCAO System (ICALEPCS, Geneva, October 2005) 1 The Gemini MCAO System Andy Foster Observatory Sciences Ltd William James House,

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

PC-based data acquisition I

PC-based data acquisition I FYS3240 PC-based instrumentation and microcontrollers PC-based data acquisition I Spring 2016 Lecture #8 Bekkeng, 20.01.2016 General-purpose computer With a Personal Computer (PC) we mean a general-purpose

More information

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17415 Reference Architecture Dell EMC Solutions Copyright

More information

10GE network tests with UDP. Janusz Szuba European XFEL

10GE network tests with UDP. Janusz Szuba European XFEL 10GE network tests with UDP Janusz Szuba European XFEL Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary 10GE

More information

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016 LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized

More information

GPUs as better MPI Citizens

GPUs as better MPI Citizens s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge

More information

OpenMPDK and unvme User Space Device Driver for Server and Data Center

OpenMPDK and unvme User Space Device Driver for Server and Data Center OpenMPDK and unvme User Space Device Driver for Server and Data Center Open source for maximally utilizing Samsung s state-of-art Storage Solution in shorter development time White Paper 2 Target Audience

More information

QuickSpecs. HP Z 10GbE Dual Port Module. Models

QuickSpecs. HP Z 10GbE Dual Port Module. Models Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or

More information

NLVMUG 16 maart Display protocols in Horizon

NLVMUG 16 maart Display protocols in Horizon NLVMUG 16 maart 2017 Display protocols in Horizon NLVMUG 16 maart 2017 Display protocols in Horizon Topics Introduction Display protocols - Basics PCoIP vs Blast Extreme Optimizing Monitoring Future Recap

More information

WearDrive: Fast and Energy Efficient Storage for Wearables

WearDrive: Fast and Energy Efficient Storage for Wearables WearDrive: Fast and Energy Efficient Storage for Wearables Reza Shisheie Cleveland State University CIS 601 Wearable Computing: A New Era 2 Wearable Computing: A New Era Notifications Fitness/Healthcare

More information

PCI Express High Speed Fabric. A complete solution for high speed system connectivity

PCI Express High Speed Fabric. A complete solution for high speed system connectivity PCI Express High Speed Fabric A complete solution for high speed system connectivity PCI Express Fabric Introduction Maximizing application performance is a combination of processing, communication, and

More information