Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing

Size: px
Start display at page:

Download "Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing"

Transcription

1 Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Nikolaus Rath March 20th, 2013 N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

2 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

3 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

4 Fusion keeps the Sun Burning Nuclear fusion is the process that keeps the sun burning. Very hot hydrogen atoms (the plasma ) collide to form helium, releasing lots of energy Would be great to replicate this on earth. Plenty of fuel available, and no risk of nuclear meltdown. Challenges: heat things to millions of degrees (not so hard), and keep them confined (very hard) 2 H 3 H 4 He MeV n MeV N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

5 At Millions of Degrees, Small Plasmas Evaporate Away N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

6 Magnetic Fields Constrain Plasma Movement to One Dimension N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

7 Closed Magnetic Fields Can Confine Plasmas N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

8 Tokamaks Confine Plasmas Using Magnetic Fields Orange, Magenta, Green: magnetic field generating coils Violet: plasma; Blue: single magnetic field line (example) 1 meter radius, 1 million C, Ampere current N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

9 Self Generated Fields Cause Instabilities Electric currents (which generate magnetic fields) flow not just in the coils, but also in the plasma itself The plasma thus modifies the fields that confine it... sometimes in a self-amplifying way instability Typical shape: rotating, helical deformation. Timescale: 50 microseconds. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

10 Only High-Speed Feedback Control Can Preserve Confinement Sensors detect deformations due to plasma currents Control coils dynamically push back feedback control N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

11 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

12 Real-Time Performance is Determined By Latency and Sampling Period latency sampling period S GPU Processing Pipelines S S S S sample paket Digitizer S S S S S Analog Output Latency is response time of feedback system Sampling period determines smoothness Algorithmic complexity limits latency, not sampling period Need both latency and sampling period in the order of few microseconds N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

13 Control Algorithm is Implemented in One Kernel CPU GPU CPU GPU Read input data Send parameters to GPU memory Process data Start GPU kernel Read data Send data to GPU memory Process data Start GPU kernel A Compute result a Compute result a Wait for GPU kernel A Process results Read results from Compute GPU Memory result b Process results... Send new data to Write output data GPU memory Start GPU kernel B Wait for GPU kernel Compute result b Wait for GPU kernel B Read results from GPU Memory Write output data Time N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

14 Redundant PCIe Transfers have to be Avoided To Reduce Latency Traditional Data bounces through host RAM PCIe bus has multi GB/s throughput Transfer setup takes several µs Okay if data chunks are big, transfer and processing takes long Bad if latency is longer than transfer time N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

15 Redundant PCIe Transfers have to be Avoided To Reduce Latency New Peer-to-peer transfers eliminate need for bounce buffer Good performance even for small amounts of data Can be implemented in software (kernel) Required peer-to-peer capable root-complex present in most midto high-end mainboards. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

16 Peer-to-peer PCIe transfers are set up by sharing BARs GPU GPU Memory A/D Module D/A Module BARs 0x01 0x02 0x03 DMA Controller BARs 0x05 0x06 0x03 DMA Controller BARs 0x08 0x09 0x01 writes reads Initialized from BIOS by CPU PCIe devices communicate via BARs in the PCI address space GPU can map part of its memory into a BAR AD/DA modules can transfer to/from arbitrary PCI address CPU establishes communication by telling AD/DA modules about GPU BAR. Required some trickery in the past, but with CUDA 5 now officially supported. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

17 Example: Userspace /* Allocate buffer with extra space for 64kb alignment */ CUdeviceptr dev_addr; cumemalloc(&dev_addr, size + 0xFFFF); /* Prepare mapping */ CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens; cupointergetattribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_addr); /* Align to 64kb */ dev_addr += 0xFFFF; dev_addr &= ~0xFFFF; /* Call custom kernel module to get bus address, refers to open device file */ struct rdma_info s; s.dev_addr = dev_addr; s.p2ptoken = tokens.p2ptoken; s.vaspacetoken = tokens.vaspacetoken; s.size = size; ioctl(fd, RDMA_TRANSLATE_TOKEN, &s) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

18 Example: Kernelspace long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { nvidia_p2p_page_table_t *page_table; //... switch(cmd){ case RDMA_TRANSLATE_TOKEN: { } COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info)); nvidia_p2p_get_pages(rdma_info.p2ptoken, rdma_info.vaspacetoken, rdma_info.dev_addr, rdma_info.size, &page_table, rdma_free_callback, tdev); rdma_info.bus_addr = page_table->pages[0]->physical_address; COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info)); return 0; } // Other ioctls N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

19 Userspace Continued /* Call custom kernel module to get bus address, refers to open device file */ rtm_t_rdma_info s; s.dev_addr = dev_addr; ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s) /* Retrieve bus address */ uint64_t bus_addr; bus_addr = s.bus_addr; /* Send bus address to digitizer */ init_rtm_t(bus_addr, other, stuff, here); // Start GPU kernel // Kernel polls input buffer // Wait for kernel to complete N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

20 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

21 The HBT-EP Plasma Control System was Built with Commodity Hardware. Hardware: Workstation PC NVIDIA GeForce GTX 580 D-TACQ ACQ196 A-D Converter (96 channels, 16 bit) 2 D-TACQ AO32CPCI D-A Converter (2 x 32 channels, 16 bit) Standard Linux host system (no real-time kernel required!) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

22 P2P Transfers Reduce Latency by 50% Latency [us] GPU RAM Host RAM Sampling Period [us] Optimal latency when using host memory: 16 µs Optimal latency when using GPU memory: 10 µs 50% difference does not mean having to wait twice as long, it is the difference between things blowing up or not. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

23 GPU Beats CPU in Computational and Real-Time Performance even in the Microsecond Regime Performance tested with repeated matrix application GPU beats CPU down to 5 µs Missed samples counted in 1000 runs Missed samples with GPU: None, with CPU: up to 2.5% Sampling Period [us] Count GPU CPU Matrix Size CPU GPU Missed Samples [%] N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

24 Summary 1 The advantages of GPUs are not restricted to large problems requiring long calculations. 2 Even when processing kb sized batches under microsecond latency constraints, GPUs can be faster than CPUs, while at the same time offering better real-time performance. 3 In these regimes, data transfer overhead becomes the dominating factor, and using peer to peer transfers improves performance by more than 50%. 4 A GPU based real-time control system has been developed at Columbia University and tested for the control of magnetically confined plasmas. Contact us for details. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

25 Outline 4 Backup Slides N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

26 Latency and Sampling Period are Measured Experimentally by Copying Square Waves Volt A Time [us] B Shot Control Input Control Output Sample Clock Control algorithm set up to copy input to output 1:1 Blue trace is input square wave Green trace is output square wave Output lags behind input by control system latency Red trace is sampling interval (sampling on downward edge) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

27 Plasma Physics Results: Dominant Mode Amplitude Reduced by up to 60% 0.24 No FB g=144 g=577 Amplitude Frequency [khz] N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

28 Self Generated Fields Cause Instabilities Electric currents (which generate magnetic fields) flow not just in the coils, but also in the plasma itself The plasma thus modifies the fields that confine it... sometimes in a self-amplifying way instability Typical shape: rotating, helical deformation. Timescale: 50 microseconds. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

29 Feedback Control uses Measurements to Determine Control Signals Input Controller Control Signal / Control Output Actuators Physical Interaction System Output Physical Interaction Measurements / Control Input Sensors Goal: keep system in specific state If system is perfectly known, can calculate required control signals (open-loop control) In practice, need to use measurements to determine effects and update signals: feedback control A control system acquires measurements, performs computations, and generates control output to manipulate the system state. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

30 Data Passthrough Establishes 8 µs Lower Latency Limit Latency [us] GPU RAM Host RAM Sampling Period [us] Control system uses same buffer to write input and read output No GPU processing, so no difference between host and GPU memory Jump: 4 µs required for A-D conversion and data push Offset: 4 µs required for data pull and D-A conversion N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

Abstract. * Supported by U.S. D.O.E. Grant DE-FG02-96ER M.W. Bongard, APS-DPP, Denver, CO, October 2005

Abstract. * Supported by U.S. D.O.E. Grant DE-FG02-96ER M.W. Bongard, APS-DPP, Denver, CO, October 2005 Abstract The Phase II PEGASUS ST experiment includes fully programmable power supplies for all magnet coils. These will be integrated with a digital feedback plasma control system (PCS), based on the PCS

More information

A new architecture for real-time control in RFX-mod G. Manduchi, A. Barbalace Big Physics Symposium 1/16

A new architecture for real-time control in RFX-mod G. Manduchi, A. Barbalace Big Physics Symposium 1/16 A new architecture for real-time control in RFX-mod G. Manduchi, A. Barbalace 2011 Big Physics Symposium 1/16 Current RFX control system MHD mode control Plasma position control Toroidal field control

More information

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand Device Programming Nima Honarmand read/write interrupt read/write Spring 2017 :: CSE 506 Device Interface (Logical View) Device Interface Components: Device registers Device Memory DMA buffers Interrupt

More information

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Input / Output. Kevin Webb Swarthmore College April 12, 2018 Input / Output Kevin Webb Swarthmore College April 12, 2018 xkcd #927 Fortunately, the charging one has been solved now that we've all standardized on mini-usb. Or is it micro-usb? Today s Goals Characterize

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Asynchronous Peer-to-Peer Device Communication

Asynchronous Peer-to-Peer Device Communication 13th ANNUAL WORKSHOP 2017 Asynchronous Peer-to-Peer Device Communication Feras Daoud, Leon Romanovsky [ 28 March, 2017 ] Agenda Peer-to-Peer communication PeerDirect technology PeerDirect and PeerDirect

More information

I/O Management Intro. Chapter 5

I/O Management Intro. Chapter 5 I/O Management Intro Chapter 5 1 Learning Outcomes A high-level understanding of the properties of a variety of I/O devices. An understanding of methods of interacting with I/O devices. 2 I/O Devices There

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems

I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems I/O Systems (3): Clocks and Timers CSE 2431: Introduction to Operating Systems 1 Outline Clock Hardware Clock Software Soft Timers 2 Two Types of Clocks Simple clock: tied to the 110- or 220-volt power

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Module 11: I/O Systems

Module 11: I/O Systems Module 11: I/O Systems Reading: Chapter 13 Objectives Explore the structure of the operating system s I/O subsystem. Discuss the principles of I/O hardware and its complexity. Provide details on the performance

More information

Operating Systems. File Systems. Thomas Ropars.

Operating Systems. File Systems. Thomas Ropars. 1 Operating Systems File Systems Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2017 2 References The content of these lectures is inspired by: The lecture notes of Prof. David Mazières. Operating

More information

EN1640: Design of Computing Systems Topic 07: I/O

EN1640: Design of Computing Systems Topic 07: I/O EN1640: Design of Computing Systems Topic 07: I/O Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2017 [ material

More information

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer 2018 Storage Developer Conference. Eidetic Communications Inc. All Rights Reserved. 1 Outline Motivation

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 Input/Output (IO) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke) IO:

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Alcator C-Mod Digital Plasma Control System Presented by: S. Wolfe, J. Stillerman, M. Ferrara, T. Fredian, I. Hutchinson

Alcator C-Mod Digital Plasma Control System Presented by: S. Wolfe, J. Stillerman, M. Ferrara, T. Fredian, I. Hutchinson C-Mod Digital Plasma Control System Presented by: S. Wolfe, J. Stillerman, M. Ferrara, T. Fredian, I. Hutchinson APS DPP05 Denver, CO Oct. 26, 2005 Abstract A new digital plasma control system (DPCS) has

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Extreme Storage Performance with exflash DIMM and AMPS

Extreme Storage Performance with exflash DIMM and AMPS Extreme Storage Performance with exflash DIMM and AMPS 214 by 6East Technologies, Inc. and Lenovo Corporation All trademarks or registered trademarks mentioned here are the property of their respective

More information

ECEN 449 Microprocessor System Design. Hardware-Software Communication. Texas A&M University

ECEN 449 Microprocessor System Design. Hardware-Software Communication. Texas A&M University ECEN 449 Microprocessor System Design Hardware-Software Communication 1 Objectives of this Lecture Unit Learn basics of Hardware-Software communication Memory Mapped I/O Polling/Interrupts 2 Motivation

More information

DEVELOPING A LINUX KERNEL MODULE USING RDMA FOR GPUDIRECT

DEVELOPING A LINUX KERNEL MODULE USING RDMA FOR GPUDIRECT DEVELOPING A LINUX KERNEL MODULE USING RDMA FOR GPUDIRECT TB-06712-001 _v8.0 September 2016 Application Guide TABLE OF CONTENTS Chapter 1. Overview... 1 1.1. How RDMA Works...2 1.2. Standard DMA Transfer...2

More information

The Power of Batching in the Click Modular Router

The Power of Batching in the Click Modular Router The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering

More information

Implementation of the Pegasus Digital Plasma Control System

Implementation of the Pegasus Digital Plasma Control System Implementation of the Pegasus Digital Plasma Control System M.W. Bongard, D.J. Battaglia, R.J. Fonck, G.D. Garstka, B.T. Lewicki, B.J. Squires, E.A. Unterberg Abstract A primary goal of the Phase II PEGASUS

More information

PC-based data acquisition I

PC-based data acquisition I FYS3240 PC-based instrumentation and microcontrollers PC-based data acquisition I Spring 2016 Lecture #8 Bekkeng, 20.01.2016 General-purpose computer With a Personal Computer (PC) we mean a general-purpose

More information

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein : Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein What do we do? Enable efficient file I/O for s Why? Support diverse

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Best Practices for Deploying and Managing GPU Clusters

Best Practices for Deploying and Managing GPU Clusters Best Practices for Deploying and Managing GPU Clusters Dale Southard, NVIDIA dsouthard@nvidia.com About the Speaker and You [Dale] is a senior solution architect with NVIDIA (I fix things). I primarily

More information

[08] IO SUBSYSTEM 1. 1

[08] IO SUBSYSTEM 1. 1 [08] IO SUBSYSTEM 1. 1 OUTLINE Input/Output (IO) Hardware Device Classes OS Interfaces Performing IO Polled Mode Interrupt Driven Blocking vs Non-blocking Handling IO Buffering & Strategies Other Issues

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Realtime Signal Processing on Embedded GPUs

Realtime Signal Processing on Embedded GPUs Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation

More information

CDA3101 Recitation Section 13

CDA3101 Recitation Section 13 CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.

More information

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition Chapter 13: I/O Systems Silberschatz, Galvin and Gagne 2013 Chapter 13: I/O Systems Overview I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations

More information

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware CSE 120 July 27, 2006 Day 8 Input/Output Instructor: Neil Rhodes How hardware works Operating Systems Layer What the kernel does API What the programmer does Overview 2 Kinds Block devices: read/write

More information

Using Time Division Multiplexing to support Real-time Networking on Ethernet

Using Time Division Multiplexing to support Real-time Networking on Ethernet Using Time Division Multiplexing to support Real-time Networking on Ethernet Hariprasad Sampathkumar 25 th January 2005 Master s Thesis Defense Committee Dr. Douglas Niehaus, Chair Dr. Jeremiah James,

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Studying GPU based RTC for TMT NFIRAOS

Studying GPU based RTC for TMT NFIRAOS Studying GPU based RTC for TMT NFIRAOS Lianqi Wang Thirty Meter Telescope Project RTC Workshop Dec 04, 2012 1 Outline Tomography with iterative algorithms on GPUs Matri vector multiply approach Assembling

More information

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Data Storage and Query Answering. Data Storage and Disk Structure (2) Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor

More information

I/O Systems. Jo, Heeseung

I/O Systems. Jo, Heeseung I/O Systems Jo, Heeseung Today's Topics Device characteristics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine

More information

Advanced NI-DAQmx Programming Techniques with LabVIEW

Advanced NI-DAQmx Programming Techniques with LabVIEW Advanced NI-DAQmx Programming Techniques with LabVIEW Agenda Understanding Your Hardware Data Acquisition Systems Data Acquisition Device Subsystems Advanced Programming with NI-DAQmx Understanding Your

More information

CSE380 - Operating Systems. Communicating with Devices

CSE380 - Operating Systems. Communicating with Devices CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped

More information

CS330: Operating System and Lab. (Spring 2006) I/O Systems

CS330: Operating System and Lab. (Spring 2006) I/O Systems CS330: Operating System and Lab. (Spring 2006) I/O Systems Today s Topics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking

More information

EE , GPU Programming

EE , GPU Programming EE 4702-1, GPU Programming When / Where Here (1218 Patrick F. Taylor Hall), MWF 11:30-12:20 Fall 2017 http://www.ece.lsu.edu/koppel/gpup/ Offered By David M. Koppelman Room 3316R Patrick F. Taylor Hall

More information

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Getting Connected (Chapter 2 Part 4) Networking CS 3470, Section 1 Sarah Diesburg

Getting Connected (Chapter 2 Part 4) Networking CS 3470, Section 1 Sarah Diesburg Getting Connected (Chapter 2 Part 4) Networking CS 3470, Section 1 Sarah Diesburg Five Problems Encoding/decoding Framing Error Detection Error Correction Media Access Five Problems Encoding/decoding Framing

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Review: Hardware user/kernel boundary

Review: Hardware user/kernel boundary Review: Hardware user/kernel boundary applic. applic. applic. user lib lib lib kernel syscall pg fault syscall FS VM sockets disk disk NIC context switch TCP retransmits,... device interrupts Processor

More information

New Development of EPICS-based Data Acquisition System for Millimeter-wave Interferometer in KSTAR Tokamak

New Development of EPICS-based Data Acquisition System for Millimeter-wave Interferometer in KSTAR Tokamak October 10-14, 2011 Grenoble, France New Development of EPICS-based Data Acquisition System for Millimeter-wave Interferometer in KSTAR Tokamak October 11, 2011, Taegu Lee KSTAR Research Center 2 Outlines

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Marine Acoustic Acquisition System

Marine Acoustic Acquisition System Omiga Technology Ltd was founded in 2000 providing bespoke software and hardware solutions for high speed data acquisition systems and data analysis. The majority of solutions provided are based on National

More information

Acquisition of experimental data

Acquisition of experimental data Otto-von-Guericke-Univ. Magdeburg Vorlesung «Messtechnik» Acquisition of experimental data Dominique Thévenin, Katja Zähringer Lehrstuhl für Strömungsmechanik und Strömungstechnik (LSS) thevenin@ovgu.de,

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Methods to protect proprietary components in device drivers

Methods to protect proprietary components in device drivers Methods to protect proprietary components in device drivers Matt Porter Embedded Alley Solutions, Inc. Introduction Why the interest in closed drivers on Linux? Competition Advantage perception Upsell

More information

Important new NVMe features for optimizing the data pipeline

Important new NVMe features for optimizing the data pipeline Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1 Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and

More information

Memories: Memory Technology

Memories: Memory Technology Memories: Memory Technology Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Memory Hierarchy

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Triton file systems - an introduction. slide 1 of 28

Triton file systems - an introduction. slide 1 of 28 Triton file systems - an introduction slide 1 of 28 File systems Motivation & basic concepts Storage locations Basic flow of IO Do's and Don'ts Exercises slide 2 of 28 File systems: Motivation Case #1:

More information

Devices and Device Controllers. secondary storage (disks, tape) and storage controllers

Devices and Device Controllers. secondary storage (disks, tape) and storage controllers I/O 1 Devices and Device Controllers network interface graphics adapter secondary storage (disks, tape) and storage controllers serial (e.g., mouse, keyboard) sound co-processors... I/O 2 Bus Architecture

More information

Bus Architecture Example

Bus Architecture Example I/O 1 network interface graphics adapter Devices and Device Controllers secondary storage (disks, tape) and storage controllers serial (e.g., mouse, keyboard) sound co-processors... I/O 2 Bus Architecture

More information

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang

Interaction of Fluid Simulation Based on PhysX Physics Engine. Huibai Wang, Jianfei Wan, Fengquan Zhang 4th International Conference on Sensors, Measurement and Intelligent Materials (ICSMIM 2015) Interaction of Fluid Simulation Based on PhysX Physics Engine Huibai Wang, Jianfei Wan, Fengquan Zhang College

More information

Efficient Data Transfers

Efficient Data Transfers Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global

More information

Using Ethernet for real-time communication in a Nuclear Fusion Experiment

Using Ethernet for real-time communication in a Nuclear Fusion Experiment Using Ethernet for real-time communication in a Nuclear Fusion Experiment A. Luchetta, G. Manduchi, C. Taliercio Consorzio RFX Euratom-ENEA Association Corso Stati Uniti 4, 35127 Padova, Italy Gabriele

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole OS-Related Hardware & Software 2 Lecture 2 Overview OS-Related Hardware & Software - complications in real systems - brief introduction to memory protection,

More information

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft 1 Outline Aspiration RDMA NIC as a Persistent Memory storage adapter Steps to there: Flush Write-after-flush Integrity Privacy QoS Some

More information

Chapter 6. Storage and Other I/O Topics

Chapter 6. Storage and Other I/O Topics Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

... Application Note AN-531. PCI Express System Interconnect Software Architecture. Notes Introduction. System Architecture.

... Application Note AN-531. PCI Express System Interconnect Software Architecture. Notes Introduction. System Architecture. PCI Express System Interconnect Software Architecture Application Note AN-531 Introduction By Kwok Kong A multi-peer system using a standard-based PCI Express (PCIe ) multi-port switch as the system interconnect

More information

Demystifying Network Cards

Demystifying Network Cards Demystifying Network Cards Paul Emmerich December 27, 2017 Chair of Network Architectures and Services About me PhD student at Researching performance of software packet processing systems Mostly working

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

Computer Science 61C Spring Friedland and Weaver. Input/Output

Computer Science 61C Spring Friedland and Weaver. Input/Output Input/Output 1 A Computer is Useless without I/O I/O handles persistent storage Disks, SSD memory, etc I/O handles user interfaces Keyboard/mouse/display I/O handles network 2 Basic I/O: Devices are Memory

More information

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission Filesystem Disclaimer: some slides are adopted from book authors slides with permission 1 Recap Directory A special file contains (inode, filename) mappings Caching Directory cache Accelerate to find inode

More information

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories J. Stone, K. Vandivort, K. Schulten Theoretical and Computational Biophysics Group Beckman Institute

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information