Multi2sim Kepler: A Detailed Architectural GPU Simulator

Similar documents
Visualization of OpenCL Application Execution on CPU-GPU Systems

A Framework for Visualization of OpenCL Applications Execution

Programming and Simulating Fused Devices. Part 2 Multi2Sim

Cache Memory Access Patterns in the GPU Architecture

Multi-Architecture ISA-Level Simulation of OpenCL

Handout 3. HSAIL and A SIMT GPU Simulator

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Simulation of OpenCL and APUs on Multi2Sim 4.1

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Architecture & Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA Fermi Architecture

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA. Matthew Joyner, Jeremy Williams

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

From Application to Technology OpenCL Application Processors Chung-Ho Chen

NORTHEASTERN UNIVERSITY

Analyzing CUDA Workloads Using a Detailed GPU Simulator

LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS

Accelerated Machine Learning Algorithms in Python

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

A Universal Parallel Front End for Execution Driven Microarchitecture Simulation

Tuning CUDA Applications for Fermi. Version 1.2

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Exploring the features of OpenCL 2.0

CUDA Programming Model

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Tesla GPU Computing A Revolution in High Performance Computing

Advanced and parallel architectures

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Course web site: teaching/courses/car. Piazza discussion forum:

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Auto-tunable GPU BLAS

Real-time Graphics 9. GPGPU

Trends in the Infrastructure of Computing

Introduction to GPU programming with CUDA

Profiling of Data-Parallel Processors

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Warps and Reduction Algorithms

GRAPHICS PROCESSING UNITS

Caracal: Dynamic Translation of Runtime Environments for GPUs

ECE 8823: GPU Architectures. Objectives

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Debugging Your CUDA Applications With CUDA-GDB

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Implementing an efficient method of check-pointing on CPU-GPU

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Characterizing Scalar Opportunities in GPGPU Applications

Real-time Graphics 9. GPGPU

An Evaluation of Unified Memory Technology on NVIDIA GPUs

simcuda: A C++ based CUDA Simulation Framework

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Kepler Overview Mark Ebersole

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Tesla GPU Computing A Revolution in High Performance Computing

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

GPU Performance Nuggets

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

Introduction to CUDA

High-Performance Packet Classification on GPU

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

GPU Computing with NVIDIA s new Kepler Architecture

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

Portland State University ECE 588/688. Graphics Processors

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

SPOC : GPGPU programming through Stream Processing with OCaml

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Introduction to GPU hardware and to CUDA

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CS427 Multicore Architecture and Parallel Computing

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Antonio R. Miele Marco D. Santambrogio

The rcuda middleware and applications

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Transcription:

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering Northeastern University Boston, MA

WHY USE SIMULATORS Designing and fabricating chips are expensive A significant amount of the cost of delivering a new chip involves design verification/validation May take many years to fully test a new microarchitecture Challenging to predict the performance and power prior to silicon Leverage software to evaluate models of proposed designs Support design space exploration Allows validation before hardware becomes available Allows software developers to evaluate optimize performance

BACKGROUND GPU has become pervasive in high performance and data center environments Simulation is one of the key toolsets for computer architects to evaluate future designs Given the rapid growth in GPU computing, the research community requires accurate GPU simulation tools

BACKGROUND Multi2Sim AMD Evergreen/ Southern Island NVIDIA Fermi GPGPUSim NVIDIA Kepler?

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK A simulator for CPU, GPU and Heterogeneous systems Support for CPU architectures: X86, ARM, and MIPS Support for GPU architectures: AMD southern islands, NVIDIA Kepler Support for HSA Intermediate Language Based on C++ 11 Large user base and open source developer community Maintained through Github (https://github.com/multi2sim) a on C++ 1

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK Disasm. Emulation Timing Simulation Visual tool ARM ü In progress MIPS ü In progress x86 ü ü ü ü AMD Southern Islands ü ü ü ü NVIDIA Kepler ü ü ü In progress HSA Intermediate Language ü ü In progress In progress Available in Multi2Sim 5.0 NVIDIA Kepler, Southern Islands, and x86 supported Three other CPU/GPU architectures in progress

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK Modular implementation Four clearly different software modules per architecture (x86, MIPS, Kepler.) Each module provides a standard interface for stand-alone execution, or interaction with other modules

Outline Introduction & Background CUDA Execution Kepler simulation Evaluation Conclusions

CUDA EXECUTION SIMULATION LEVEL SASS: NVIDIA ShaderAssembly, the native GPU ISA PTX: a higher-level intermediate language compared to SASS defined by NVIDIA The SASS code changes for each different generation of NVIDIA GPU, while PTX code is architecture independent ümulti2sim Kepler is designed to support NVIDIA SASS

CUDA EXECUTION SIMULATION LEVEL L PTX execution is very different than SASS execution L

CUDA EXECUTION SIMULATION LEVEL It is important to run SASS The number of registers is limited in SASS, but is unlimited in PTX Schedulers will have more restrictions when working at the SASS level More ISA-specific issues can be considered when we run SASS Running SASS simulation is much closer to the actual execution in recent GPUs (i.e., Kepler GPUs)

CUDA EXECUTION CUDA SUPPORT ON MULTI2SIM The figure shows the modular organization of the CUDA execution framework, based on 4 software/hardware entities. In each case, we compare native execution with simulated execution.

CUDA EXECUTION SIMULATION CHALLENGES Driver & Runtime APIs Implement our own CUDA Driver & Runtime APIs ISA Level Reverse Engineering of the whole Kepler ISA since there is no public information Microarchitecture Implement benchmarks to reverse engineer and test all hardware related specifications

Outline Introduction & Background CUDA support on Multi2Sim Kepler simulation Evaluation Conclusions

KEPLER SIMULATION DISASSEMBLER & EMULATOR

KEPLER SIMULATION DISASSEMBLER & EMULATOR Disassembler Reads from CUDA binary file and dumps a text-based output of all fragments of GPU ISA code found in the file Outputs SASS (shader assembly) instructions one by one to emulator Emulator Reads instructions from disassembler, reproduce the original behavior of a guest program Providing instructions information to timing simulator Support CUDA SDK 6.5 benchmark suite (21 supported), other benchmark suite will be supported in the future

KEPLER SIMULATION TIMING SIMULATOR

KEPLER SIMULATION TIMING SIMULATION

KEPLER SIMULATION TIMING SIMULATION

KEPLER SIMULATION TIMING SIMULATION Support for detailed architectural models for GPU hardware components SMs, Warp schedulers, execution units, memory and etc. Support for instruction pipeline exploration Pipelines for different kinds of instructions such as integer, floating point and control flow Provides architecture-related statistics Cache miss/hits, instructions retired, occupany, etc.

KEPLER SIMULATION EMULATOR Produces CUDA kernel results Emulates instructions and updates registers and memory Produces execution statistics Number of executed grids and blocks Dynamic instruction mix of the kernel and etc. Produces an ISA-level trace Instruction emulation trace

KEPLER SIMULATION ARCHITECTURAL SIMULATION Models SMs, memory hierarchy and other hardware details Maps thread blocks onto SMs and warp pools Emulates instructions and propagates state through the execution pipelines Models resource usage and contention

KEPLER SIMULATION MULTI2SIM KEPLER ADVANTAGES Support for CPU-GPU heterogeneous simulation Support for NVIDIA Kepler native SASS execution Support for detailed NVIDIA Kepler micorarchitectural exploration

Outline Introduction & Background CUDA support on Multi2Sim Kepler simulation Evaluation Conclusions

EVALUATION Emulator Statistics: Number of instructions executed, instructions classification, percentage of each kind instruction

EVALUATION Average execution time for different input sets on each benchmark In general, there is good fidelity with the K20X HM is on outlier, since it uses st.wt and ld.cv instructions, changing cache policy

EVALUATION Input sizes: From 1K to 128K

EVALUATION Input size: From 128x128, to 1024x1024

EVALUATION Input sizes: From 32K to 1M

EVALUATION Performance achieved by changing the number of lanes for each pspu per SMX MatrixTranspose shows greater speedup than VectorAdd, because it is less memory sensitive

Outline Introduction & Background CUDA support on Multi2Sim Kepler simulation Evaluation Conclusions

CONCLUSIONS Summary Presented Multi2sim Kepler, a detailed performance simulator supporting NVIDIA Kepler SASS execution Provided example architectural studies, exploring Kepler GPU microarchitecture Showed the benefits of the infrastructure by evaluating application characteristics Future work Support more benchmarks Implement new CUDA runtime and driver APIs Improve the accuracy of our simulator, focusing on memory model

Thank you! Questions? * This work is supported in part by NSF Grant CNS-1525412, and through generous donations from NVIDIA, AMD and the Heterogeneous Systems Foundation.