Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Similar documents
S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction ScalA 17, Denver, CO, USA, November 13, 2017

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Profiling of Data-Parallel Processors

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Mapping MPI+X Applications to Multi-GPU Architectures

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Leveraging Flash in HPC Systems

High Performance Computing on GPUs using NVIDIA CUDA

GPU Performance Nuggets

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Basics of Performance Engineering

Tuning CUDA Applications for Fermi. Version 1.2

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

LARISA C. STOLTZFUS Informatics Forum 10 Crichton Street Edinburgh, UK EH8 9AB ed.ac.uk

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

PREDICTING COMMUNICATION PERFORMANCE

OpenACC Course. Office Hour #2 Q&A

The State and Needs of IO Performance Tools

5.4 - DAOS Demonstration and Benchmark Report

CS 179: GPU Computing

Introduction to GPU programming with CUDA

S CUDA on Xavier

Accelerated Machine Learning Algorithms in Python

Experiences of Using the OpenMP Accelerator Model to Port DOE Stencil Applications

Profiling & Tuning Applications. CUDA Course István Reguly

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Power Constrained HPC

CUPTI. DA _v5.5 July User's Guide

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Improving Performance of Machine Learning Workloads

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

FastForward I/O and Storage: ACG 8.6 Demonstration

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

VIProf: A Vertically Integrated Full-System Profiler

Maximizing Face Detection Performance

The Case for Heterogeneous HTAP

Automatic Tuning of Sparse Matrix Kernels

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

PORPLE: An Extensible Optimizer for Portable Data Placement on GPU

IBM Spectrum Scale IO performance

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

LoGA: Low-Overhead GPU Accounting Using Events

Reusable, Generic Compiler Analyses and Transformations

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

April 4-7, 2016 Silicon Valley

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Early Experiences with the OpenMP Accelerator Model

Adaptive Power Profiling for Many-Core HPC Architectures

System Software Solutions for Exploiting Power Limited HPC Systems

CUPTI. DA _v10.1 February User's Guide

Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Software Defined Hardware

H.264 Decoding. University of Central Florida

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

High Performance Matrix-matrix Multiplication of Very Small Matrices

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

CUDA Experiences: Over-Optimization and Future HPC

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CME 213 S PRING Eric Darve

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Diogenes: A tool for exposing Hidden GPU Performance Opportunities

TUNING CUDA APPLICATIONS FOR MAXWELL

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Transcription:

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory MCHPC'18: Workshop on Memory Centric High Performance Computing This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Complex Memory Hierarchy on GPUs GPUs can greatly improve performance of HPC applications, but can be difficult to optimize for due to their complex memory hierarchy Memory hierarchies can change drastically from generation to generation Codes optimized for one platform may not retain optimal performance when ported to other platforms 2

Performance can vary widely depending on data placement as well as platform 4 Kepler Maxwell Pascal Volta Matrix-Matrix Multiplication Speedup 3 2 1 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Kepler Maxwell Pascal Volta Sparse Matrix-Vector Multiplication Speedup 1.5 1.0 0.5 0.0 1 2345 6 789 1 2345 6 789 1 2345 6 789 1 2345 6 789 3

Challenges Different memory variants (global/ constant/ texture/ shared) can have significant impact on program performance But identifying the best performing variant is non-obvious and complex decision to make Given a default global variant, can the best performing memory variant be automatically determined? 4

Proposed Solution Use machine learning to develop a predictive model to determine the best data placement for a given application on a particular platform Use the model to predict best placement Involves three stages: offline training feature and model selection online inference 5

Approach Offline training Online inference Training Data collection Feature Extraction and labelling Representative kernels Use nvprof, HW info program variants global shared constant texture Feature extraction using CUPTI Offline Training - Data collection of nvprof metrics and events Training data Classifier best variant Classifier 6

Approach Offline training Online inference Training Data collection Feature Extraction and labelling Representative kernels Use nvprof, HW info program variants global shared constant texture Feature extraction using CUPTI Model Building - Determine best version, features and model Training data Classifier best variant Classifier 7

Approach Offline training Online inference Representative kernels program variants global shared constant texture Training Data collection Feature Extraction and labelling Use nvprof, HW info Feature extraction using CUPTI Online Inference: Use model to determine best placement in run-time Training data Classifier Classifier best variant 8

Methodology In order to build the model: 4 different generations of NVIDIA GPUs were used: Kepler Pascal Maxwell Volta 8 programs X 3 input data sizes X 3 thread block sizes X 4 variants MD, SPMV, CFD, MM, ConvolutionSeparable, ParticleFilter etc. 9

Offline Training Metric and event data from nvprof from global variant along with hardware data were collected Best performing variant (class label) for each version run was appended Benchmarks were run 10 times on each platform, with 5 initial iterations to warm up the GPU 10

Feature Selection Number of features narrowed down to 16 from 241 using correlation-based feature selection algorithm (CFS). A partial list: Feature Name achieved_occupancy l2_read_transactions, l2_write_transactions gld_throughput warp_execution_efficiency Meaning ratio of average active warps to maximum number of warps Memory read/write transactions at L2 cache global memory load throughput ratio of average active threads to the maximum number of threads 11

Model Selection Used 10-fold cross validation during evaluation Overall, decision tree classifiers showed great promise (>95% accuracy in prediction) Classifier Prediction Accuracy (%) RandomForest 95.7 LogitBoost 95.5 IterativeClassifierOptimizer 95.5 SimpleLogistic 95.4 JRip 95.0 12

Runtime Prediction The classifier JRIP was selected from the group of top five performing classifier models JRIP is a propositional rule learner, which results in a decision tree The model then reads in input from CUPTI calls - the API for nvprof - which can access hardware counters in real-time and outputs its class 13

Preliminary Results 100 texture constant shared global 75 % Predicted 50 25 0 constant global shared texture Memory Type Results from this initial exploration show that there is great potential for predictive modeling for data placement on GPUs Overall 95% accuracy achievable, but this is higher for global and texture memory best performers 14

Runtime Validation The JRIP model was tested out on a new benchmark - an acoustic application The model was successfully able to correctly predict the best performing version on two platforms 15

Limitations Currently, all versions need to be pre-compiled for run-time prediction, ideally it would be better to have model built into a compiler CUPTI calls are slow and require as many iterations as metrics and events to collect This would acceptable for benchmarks with many iterations, but for other kinds a workaround would need to be made 16

Conclusion Machine learning has shown great potential for data placement prediction on a range of applications More work needs to be done to acquire hardware counters from applications in a timely manner Approach could be reused for other optimizations such as data layouts. 17

19