CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Similar documents
Efficient Data Transfers

Lecture 10. Efficient Host-Device Data Transfer

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

Peripheral Component Interconnect - Express

AMD HyperTransport Technology-Based System Architecture

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

HyperTransport. Dennis Vega Ryan Rawlins

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

Lecture 23. Finish-up buses Storage

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

Section 3 MUST BE COMPLETED BY: 10/17

NVIDIA nforce IGP TwinBank Memory Architecture

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Intel Thunderbolt. James Coddington Ed Mackowiak

Lectures More I/O

ECE331: Hardware Organization and Design

GPU, GP-GPU, GPU computing. Luca Benini

ECE 571 Advanced Microprocessor-Based Design Lecture 20

CUDA Memories. Introduction 5/4/11

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

CEC 450 Real-Time Systems

TEAM 1073F'S TEXAS INSTRUMENTS ELECTRONIC ONLINE CHALLENGE THE DISASSEMBLY OF A COMPUTER TOWER. HP COMPAQ DC7900 CONVERTIBLE MINITOWER

HW SCREAMER! Video Editing and Gaming Computer Built with love but raging with power. Owner's Manual

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

ECE 485/585 Microprocessor System Design

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Part 1 of 3 -Understand the hardware components of computer systems

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

Introduction to the Personal Computer

Computer Architecture CS 355 Busses & I/O System

Flexible Architecture Research Machine (FARM)

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Fundamental CUDA Optimization. NVIDIA Corporation

FUNCTIONS OF COMPONENTS OF A PERSONAL COMPUTER

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

CEC 450 Real-Time Systems

ECE 571 Advanced Microprocessor-Based Design Lecture 18

PCI EXPRESS TECHNOLOGY. Jim Brewer, Dell Business and Technology Development Joe Sekel, Dell Server Architecture and Technology

Fundamental CUDA Optimization. NVIDIA Corporation

Maximizing heterogeneous system performance with ARM interconnect and CCIX

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Motherboard Components of a Desktop Computer

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Computer Architecture

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

A+ Guide to Hardware: Managing, Maintaining, and Troubleshooting, 5e. Chapter 1 Introducing Hardware

nforce 680i and 680a

CIS Operating Systems I/O Systems & Secondary Storage. Professor Qiang Zeng Fall 2017

Today: I/O Systems. Architecture of I/O Systems

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Technical Brief. AGP 8X Evolving the Graphics Interface

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Interconnection Network for Tightly Coupled Accelerators Architecture

CS330: Operating System and Lab. (Spring 2006) I/O Systems

Course Code CW-4-G Subject CHM (17428) Subject Teacher Mr. Pise S. P. Topic-1- Motherboard & It s Components

Module 6: INPUT - OUTPUT (I/O)

ECE232: Hardware Organization and Design

Interconnection Structures. Patrick Happ Raul Queiroz Feitosa

VIA Apollo P4X400 Chipset. Enabling Total System Performance

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Application Performance on Dual Processor Cluster Nodes

ECE 471 Embedded Systems Lecture 30

Chapter 1: Motherboard & Its Components

Intel QuickPath Interconnect Electrical Architecture Overview

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Tile Processor (TILEPro64)

The Memory Component

CUDA Performance Optimization. Patrick Legresley

This page intentionally left blank

ATS-GPU Real Time Signal Processing Software

Computer Systems Laboratory Sungkyunkwan University

PC I/O. May 7, Howard Huang 1

A+ Guide to Managing & Maintaining Your PC, 8th Edition. Chapter 4 All About Motherboards

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CREATING A PCI EXPRESS INTERCONNECT IN THE GEM5 SIMULATOR KRISHNA PARASURAM SRINIVASAN THESIS

PC-based data acquisition II

CSE 153 Design of Operating Systems Fall 2018

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

2. THE PCI EXPRESS BUS

Computer buses and interfaces

An FPGA-Based Optical IOH Architecture for Embedded System

SD Express Cards with PCIe and NVMeTM Interfaces

PRODUCT SPECIFICATION

Jumping Hurdles. High Expectations in a Low Power Environment. Christopher Fadeley Software Engineering Manager EIZO Rugged Solutions

Computers Are Your Future

Introduction to a Typical PC. Freedom High School

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Transcription:

CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system

Objective To understand the major factors that dictate performance when using GPU as an compute co-processor for the CPU The speeds and feeds of the traditional CPU world The speeds and feeds when employing a GPU To form a solid knowledge base for performance programming in modern GPU s

Review- Typical Structure of a CUDA Program Global variables declaration Function prototypes global void kernelone( ) Main () allocate memory space on the device cudamalloc(&d_glblvarptr, bytes ) transfer data from host to device cudamemcpy(d_glblvarptr, h_gl ) execution configuration setup kernel call kernelone<<<execution configuration>>>( args ); repeat transfer results from device to host cudamemcpy(h_glblvarptr, ) as optional: compare against golden (host computed) solution needed Kernel void kernelone(type args, ) variables declaration - local, shared automatic variables transparently assigned to registers or local memory syncthreads()

Bandwidth Gravity of Modern Computer Systems The Bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance falls back to what the speeds and feeds dictate

Classic PC architecture Northbridge connects 3 components that must be communicate at high speed CPU, DRAM, video Video also needs to have 1 st - class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset

(Original) PCI Bus Specification Connected to the southbridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 528 MB/second peak Upstream bandwidth remain slow for device (~256MB/s peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge

PCI as Memory Mapped I/O PCI device registers are mapped into the CPU s physical address space Accessed through loads/ stores (kernel mode) Addresses are assigned to the PCI devices at boot time All devices listen for their addresses

PCI Express (PCIe) Switched, point-to-point connection Each card has a dedicated link to the central switch, no bus arbitration. Packet switches messages form virtual channel Prioritized packets for QoS E.g., real-time video streaming

PCIe 2 Links and Lanes Each link consists of one or more lanes Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5Gb/s in one direction) Upstream and downstream now simultaneous and symmetric Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. Each byte data is 8b/10b encoded into 10 bits with equal number of 1 s and 0 s; net data rate 2 Gb/s per lane each way. Thus, the net data rates are 250 MB/s (x1) 500 MB/s (x2), 1GB/ s (x4), 2 GB/s (x8), 4 GB/s (x16), each way

8/10 bit encoding Goal is to maintain DC balance while have sufficient state transition for clock recovery The difference of 1s and 0s in a 20-bit stream should be 2 There should be no more than 5 consecutive 1s or 0s in any stream 00000000, 00000111, 11000001 bad 01010101, 11001100 good Find 256 good patterns among 1024 total patterns of 10 bits to encode an 8-bit data A 25% overhead

PCIe PC Architecture PCIe forms the interconnect backbone Northbridge/Southbridge are both PCIe switches Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards Some PCIe I/O cards are PCI cards with a PCI-PCIe bridge Source: Jon Stokes, PCI Express: An Overview http://arstechnica.com/ articles/paedia/hardware/ pcie.ars

GeForce 7800 GTX Board Details SLI Connector Single slot cooling svideo TV Out DVI x 2 16x PCI-Express 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32

HyperTransport Feeds and Speeds Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe HyperTransport 1.0 Specification 800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way) HyperTransport 2.0 Specification Added PCIe mapping 1.0-1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way) HyperTransport 3.0 Specification 1.8-2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) Added AC coupling to extend HyperTransport to long distance to system-to-system interconnect Courtesy HyperTransport Consortium Source: White Paper: AMD HyperTransport Technology-Based System Architecture

PCIe 3 A total of 8 Giga Transfers per second in each direction No more 8/10 encoding but uses a polynomial transformation at the transmitter and its inverse at the receiver to achieve the same effect So the effective bandwidth is double of PCIe 2

PCIe Data Transfer using DMA DMA (Direct Memory Access) is used to fully utilize the bandwidth of an I/O bus DMA uses physical address for source and destination Transfers a number of bytes requested by OS Needs pinned memory Main Memory (DRAM) CPU Global Memory DMA GPU card (or other I/O cards)

Pinned Memory DMA uses physical addresses The OS could accidentally page out the data that is being read or written by a DMA and page in another virtual page into the same location Pinned memory cannot not be paged out If a source or destination of a cudamemcpy() in the host memory is not pinned, it needs to be first copied to a pinned memory extra overhead cudamemcpy is much faster with pinned host memory source or destination

Allocate/Free Pinned Memory (a.k.a. Page Locked Memory) cudahostalloc() Three parameters Address of pointer to the allocated memory Size of the allocated memory in bytes Option use cudahostallocdefault for now cudafreehost() One parameter Pointer to the memory to be freed

Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudamemcpy function should be about 2X faster with pinned memory Pinned memory is a limited resource whose over-subscription can have serious consequences

Important Trends Knowing yesterday, today, and tomorrow The PC world is becoming flatter CPU and GPU are being fused together Outsourcing of computation is becoming easier

ANY MORE QUESTIONS?