Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele
|
|
- Thomasine Parks
- 5 years ago
- Views:
Transcription
1 Building Real-Time Professional Visualization Solutions on GPUs Kristof Denolf Samuel Maroy Ronny Dewaele
2 Page 2
3 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 3
4 Company structure Four core divisions, five wholly-owned ventures Entertainment Healthcare Control rooms & Simulation Defense & Aerospace Digital signage Lighting LED ATM software Design services Page 4
5 Healthcare Supporting healthcare professionals a billion times a year Page 5
6 Control Rooms Helping over 2.5 billion commuters get home safely every day Page 6
7 Media & Entertainment Setting the scene for over 2,500 gigs and shows every year Page 7
8 Professional Visualization High quality High resulutions Mutliple sources True colours Low latency Perfect calibration Synchronization Page 8
9 OpenCL as Initial Answer for Portability OpenCL for GPU and multi-core CPU programming of image processing chains OpenCL for GPU accelerated prototypes of new algorithms Page 9
10 Portability also Towards FPGA Design [Desh Singh, presented at DATE 2011 and FPGA 2011 Pre-Conference Workshop] Page 10 [Altera news: San Jose, Calif., November 15, 2011]
11 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 11
12 Ideal Data Transfer has Highest Rate and Virtually no Compute Impact CPU GPU in (n) GPUproc (n) out (n) CPU asynchronous Pinned CPU memory GPU parallel transfer Highest rate Maximize GPU compute & transfer time Graphics card DRAM Quadro Copy Engine Copy Engine CPU PCIe bus GPU in (n+1) GPUproc (n) out (n-1) DRAM CPU Page 12
13 (Over) Peak Data Rates Highest for Direct Transfers from/to Pinned Host Memory oclbandwidthtest testtransferspeed OpenCL Cpu2Gpu, pinned, direct Gpu2Cpu, pinned, direct Cpu2Gpu, pinned, mapped Gpu2Cpu, pinned, mapped Cpu2Gpu, paged, direct Gpu2Cpu, paged, direct Transfer Rate (MBps) Transfer Rate (MBps) Cpu2Gpu, pinned, direct Gpu2Cpu, pinned, direct Cpu2Gpu, pinned, mapped Gpu2CPU, pinned, mapped Cpu2Gpu, pinned, paged Gpu2Cpu, pinned, paged Page 13 Transfer Size (MB) Transfer Size (MB) All tests done on Q3000M on PCIe x 16 Gen2 (GPUdirect on Q4000)
14 Other Transfers Sustain a Similar Rate Page 14 Transfer Rate (MBps) OpenCL Cpu2GPU, buffer Gpu2Cpu, buffer Cpu2Gpu, image 1000 Gpu2Cpu, image Cpu2Gpu, buffergl Gpu2Cpu, buffergl Cpu2Gpu, imagegl Gpu2Cpu, imagegl Transfer Size (MB) CL buffers, images and GL interoperable variants similar Choose most appropriate CL memory type Efficiency > 4 GBps from 480p (1.3 MB) > 4.8 GBps from 720p (3.5 MB) All numbers for RGBA Write to GPU: p60 Read from GPU: p60
15 OpenCL/CUDA Transfers with Parallel Compute (Transfer Dominated) in (n+1) GPUproc (n) out (n-1) OpenCL GPU dual copy engines working GPU compute in parallel with data transfers still some gaps present CUDA Page 15
16 Throughput Impact Related to Kernel Duration Transfer Rate (MBps) OpenCL Efficiency (OpenCL) > 3.2 GBps from 480p (1.3 MB) > 3.4 GBps from 720p (3.5 MB) All numbers for RGBA Write to GPU: p60 Read from GPU: p60 Note that also maximizing the GPU compute time is hampered 1000 Page Transfer Size (MB) Peak transfer GPU parallel
17 CUDA and GPUdirect Achieve Highest Peak Rate Transfer Rate (MBps) Efficiency CUDA transfers boost to 6 GBps DVP read from GPU upto 7.5 GBps for very large transfers Other: all around 5.2 GBps How to get 6 GBps for all programming models Cpu2Gpu, OpenCL Gpu2Cpu, OpenCL Cpu2Gpu, OpenGL Gpu2Cpu, OpenGL Cpu2Gpu, GPUdirect Gpu2Cpu, GPUdirect Cpu2Gpu, CUDA Gpu2Cpu, CUDA Page Transfer Size (MB)
18 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 18
19 CL/GL Interoperability Hampers Parallelism 6000 OpenCL Transfer Rate (MBps) Page Transfer Size (MB) Peak transfer GPU parallel
20 CUDA / GL Interoperability not Trivial No GL rendering With GL rendering Page 20
21 Return to OpenGL, render on full HD screen (1/2) 6000 OpenGL Transfer Rate (MBps) Page Transfer Size (MB) Peak transfer GPU parallel
22 Return to OpenGL, Readback to CPU Memory (2/2) 6000 OpenGL Transfer Rate (MBps) Transfer Size (MB) Peak transfer GPU parallel Page 22
23 to Avoid Interoperability Issue 9 HD 1080p in at 60 fps 4.5 GBps Parallel rendering Page 23
24 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 24
25 Partial Image Transfers for Low Latency 1/8 HD (1 MB) transfer size has reasonable rate (certainly for CUDA) Concurrent partial update same image? Page 25
26 Conclusions Barco s professional visualization requires High quality High resolution Multiple sources Barco s professional visualization desires portability DMA enabled and fully parallel data transfers are essential Mind the gap: peak data rates can not be achieved contineoulsy CL or CUDA /GL interoperability is difficult Page 26
CUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationMartin Dubois, ing. Contents
Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationOverview. NVIDIA Quadro M GB Real Interactive Expression. NVIDIA Quadro M GB Part No. VCQM GB-PB.
WEB COPY NVIDIA Quadro M6000 24GB Part No. VCQM6000-24GB-PB Overview NVIDIA Quadro M6000 24GB Real Interactive Expression Get real interactive expression with NVIDIA Quadro the world s most powerful workstation
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationInterconnecting Components
Interconnecting Components Need interconnections between CPU, memory, controllers Bus: shared communication channel Parallel set of wires for data and synchronization of data transfer Can become a bottleneck
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationAsynchronous Peer-to-Peer Device Communication
13th ANNUAL WORKSHOP 2017 Asynchronous Peer-to-Peer Device Communication Feras Daoud, Leon Romanovsky [ 28 March, 2017 ] Agenda Peer-to-Peer communication PeerDirect technology PeerDirect and PeerDirect
More informationBuild cost-effective, reliable signage solutions with the 8 display output, single slot form factor NVIDIA NVS 810
WEB COPY NVIDIA NVS 810 for Eight DP Displays Part No. VCNVS810DP-PB Overview Build cost-effective, reliable signage solutions with the 8 display output, single slot form factor NVIDIA NVS 810 The NVIDIA
More informationATS-GPU Real Time Signal Processing Software
Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional
More informationParalization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationOverview. Web Copy. NVIDIA Quadro M4000 Extreme Performance in a Single-Slot Form Factor
Web Copy NVIDIA Quadro M4000 Part No. VCQM4000-PB Overview NVIDIA Quadro M4000 Extreme Performance in a Single-Slot Form Factor Get real interactive expression with NVIDIA Quadro the world s most powerful
More informationS7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION. Venugopala Madumbu, NVIDIA GTC D
S7105 ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC 2017 210D ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD) High Compute Workloads Mapped to GPU
More informationA Real Time Controller for E-ELT
A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2
More informationA Real Time Controller for E-ELT
A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationECE 571 Advanced Microprocessor-Based Design Lecture 18
ECE 571 Advanced Microprocessor-Based Design Lecture 18 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 11 November 2014 Homework #4 comments Project/HW Reminder 1 Stuff from Last
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationL17: Asynchronous Concurrent Execution, Open GL Rendering
L17: Asynchronous Concurrent Execution, Open GL Rendering Administrative Midterm - In class April 5, open notes - Review notes, readings and Lecture 15 Project Feedback - Everyone should have feedback
More informationOverview. NVIDIA Quadro M6000 Real Interactive Expression. CUDA Cores Memory Bandwidth 317 GB/s. DisplayPort 1.2 WEB COPY
WEB COPY NVIDIA Quadro M6000 Part No. VCQM6000-PB Overview NVIDIA Quadro M6000 Real Interactive Expression Get real interactive expression with NVIDIA Quadro the world s most powerful workstation graphics.
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationNVIDIA Quadro M5000 Designed for Extreme Performance and Power Efficiency
WEB COPY NVIDIA Quadro M5000 Part No. VCQM5000-PB Overview NVIDIA Quadro M5000 Designed for Extreme Performance and Power Efficiency Get real interactive expression with NVIDIA Quadro the world s most
More informationNVIDIA GPUDirect Technology. NVIDIA Corporation 2011
NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for
More informationPNY Technologies, Inc. 299 Webro Rd. Parsippany, NJ Tel: Fax:
NVIDIA Quadro FX SDI BY PNY Technologies Professional Graphics Solutions Reference Guide SDI Output solution (1) VCQ FX5800SDI-PCIE-PB (2) VCQFX4800SDI-PCIE-PB (3) VCQFX3800SDI-PCIE-PB SDI I/O (Input-Output)
More informationUser Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB
NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB User Guide PNY Technologies, Inc. 299 Webro Rd. Parsippany, NJ 07054-0218 Tel: 408.567.5500 Fax: 408.855.0680 Features and specifications
More informationFast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad
Fast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad Project Guide : Vivek Mehta, Anup Tapadia TouchMagix media labs TouchMagix www.touchmagix.com Interactive display solutions Interactive
More informationMotivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University
Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA
More informationGPM0001 E9171 GPU-based Processor Module
GPM0001 E9171 GPU-based Processor Module DO-254 Certifiable 3U VPX Graphics/Compute Module IP Features and Benefits Part of the COTS-D family of safety certifiable modules A compact GPU Processing Module
More informationInterconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationECE 571 Advanced Microprocessor-Based Design Lecture 20
ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationvs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs
Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing
More information4K Video Processing and Streaming Platform on TX1
4K Video Processing and Streaming Platform on TX1 Tobias Kammacher Dr. Matthias Rosenthal Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences
More informationThe Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration
The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17415 Reference Architecture Dell EMC Solutions Copyright
More informationHA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,
More informationAMD HD7750 PCIe ADD-IN BOARD. Datasheet (GFX-A3T2-01FST1)
AMD HD7750 PCIe ADD-IN BOARD Datasheet (GFX-A3T2-01FST1) CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Memory Aperture Size... 4 2.3. Avivo Display System... 5 2.4.
More informationSolros: A Data-Centric Operating System Architecture for Heterogeneous Computing
Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationLecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationPorting Nouveau to Tegra K1
Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA FOSDEM 2015 The Story So Far... In 2014 NVIDIA released the Tegra K1 SoC 32 bit quad-core or 64-bit dual core
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationIntroduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec
Introduction I/O 1 I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections I/O Device Summary I/O 2 I/O System
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationGraphics and Imaging Architectures
Graphics and Imaging Architectures Kayvon Fatahalian http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/ About Kayvon New faculty, just arrived from Stanford Dissertation: Evolving real-time graphics
More informationCMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN
CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output
More informationE9171-based Graphics/Compute Engine
Product Overview E9171-based Graphics/Compute Engine Compact, Power Efficient DO-254 Certifiable GPU Module IP Features and Benefits Part of the COTS-D family of safety certifiable modules A compact GPU
More informationOnyx: A Prototype Phase-Change Memory Storage Array
Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationS8901 Quadro for AI, VR and Simulation
S8901 Quadro for AI, VR and Simulation Carl Flygare, PNY Quadro Product Marketing Manager Allen Bourgoyne, NVIDIA Senior Product Marketing Manager The question of whether a computer can think is no more
More informationFC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper
FC-NVMe NVMe over Fabrics Fibre Channel the most trusted fabric can transport NVMe natively BACKGROUND AND SUMMARY Ever since IBM shipped the world s first hard disk drive (HDD), the RAMAC 305 in 1956,
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationGTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal. Julien BERNARD
Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in
More informationMatrox Imaging White Paper
Reliable high bandwidth video capture with Matrox Radient Abstract The constant drive for greater analysis resolution and higher system throughput results in the design of vision systems with multiple
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationNVIDIA GT740 PCIe ADD-IN BOARD. Datasheet GFX-N3A2-01FMS1
NVIDIA GT740 PCIe ADD-IN BOARD Datasheet GFX-N3A2-01FMS1 CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. GPU Block diagram... 4 2.2. Memory Interface... 4 2.3. Features and Technologies... 4
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More information4K Video Processing and Streaming Platform on TX1
4K Video Processing and Streaming Platform on TX1 Tobias Kammacher Dr. Matthias Rosenthal Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences
More informationCommon Computer-System and OS Structures
Common Computer-System and OS Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection General System Architecture Oct-03 1 Computer-System Architecture
More information1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.
1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7. Optical Discs 1 Structure of a Graphics Adapter Video Memory Graphics
More informationUSB to Dual DisplayPort Mini Docking Station - Dual 4K 60Hz - GbE - USB 3.0
USB to Dual DisplayPort Mini Docking Station - Dual 4K 60Hz - GbE - USB 3.0 Product ID: USBA2DPGB This TAA-compliant USB to dual DisplayPort mini docking station makes it easy to create a highperformance
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationQuickSpecs. NVIDIA Quadro K4200 4GB Graphics INTRODUCTION. NVIDIA Quadro K4200 4GB Graphics. Technical Specifications
J3G89AA INTRODUCTION The NVIDIA Quadro K4200 delivers incredible 3D application performance and capability, allowing you to take advantage of dual copy-engines for seamless data movement within GPU memory
More informationFermi Cluster for Real-Time Hyperspectral Scene Generation
Fermi Cluster for Real-Time Hyperspectral Scene Generation Gary McMillian, Ph.D. Crossfield Technology LLC 9390 Research Blvd, Suite I200 Austin, TX 78759-7366 (512)795-0220 x151 gary.mcmillian@crossfieldtech.com
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationHP WORKSTATIONS GRAPHICS CARD OPTIONS
VR HP WORKSTATIONS GRAPHICS CARD OPTIONS QUICK REFERENCE GUIDE PROFESSIONAL GRAPHICS SOLUTIONS FOR HP Z WORKSTATIONS HP is proud to exclusively offer professional graphics choices on all of our HP Workstations
More informationExploring System Coherency and Maximizing Performance of Mobile Memory Systems
Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationDell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration
Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17250 Reference Architecture Abstract This
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1
AMD HD5450 PCIe ADD-IN BOARD Datasheet AEGX-A3T5-01FST1 CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Acceleration Features... 4 2.3. Avivo Display System... 5 2.4.
More informationAMD HD7750 2GB PCIEx16
AMD HD7750 2GB PCIEx16 ADVANTECH MODEL: GFX-AH7750L16-5J MPN number: 1A1-E000130ADP Performance PCIe Graphics 4 x Mini DP CONTENTS 1. Specification... 3 2. Functional Overview... 4 2.1. Memory Interface...
More informationNon-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance
Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance By Robert E Larsen NVM Cache Product Line Manager Intel Corporation August 2008 1 Legal Disclaimer INFORMATION IN THIS
More informationNVIDIA Quadro K6000 SDI Reference Guide
NVIDIA Quadro K6000 SDI Reference Guide NVIDIA Quadro K6000 SDI Output Graphics Solution PNY Part Number: VCQK6000SDI-PB NVIDIA Quadro SDI I/O (Input/Output) Graphics Solution PNY Part Number: VCQK6000SDI-IO-PB
More informationThe Future of Interconnect Technology
The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationGraphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications
Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications Workshop on Space Flight Software November 6, 2009 Brent Tweddle Massachusetts Institute of Technology
More informationTowards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu
Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson pearson@illinois.edu Adviser: Wen-Mei Hwu 2018 03 30 1 Outline High Performance Computing Challenges Vision CUDA Allocation and
More informationSapphire Nitro+ Radeon RX 580 4GD5 (UEFI) SKU number:
SPECIFICATION GPU: AMD Radeon RX 580 Graphics Stream Processors: Up to 2304 unit Compute Units: 36 Boost Core Clock: Up to 1411 MHz Memory Clock: Up to 2000 MHz, Effective 8000 Mbps (Samsung Memory) Memory
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationNVIDIA Fermi Architecture
Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster
More informationNVIDIA Parallel Nsight. Jeff Kiel
NVIDIA Parallel Nsight Jeff Kiel Agenda: NVIDIA Parallel Nsight Programmable GPU Development Presenting Parallel Nsight Demo Questions/Feedback Programmable GPU Development More programmability = more
More informationGPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake
GPU acceleration on IB clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake HPC Advisory Council European Workshop 2011 Why it matters? (Single node GPU acceleration) Control
More information