Computer Science, UCL, London

Similar documents
Improving 3D Medical Image Registration CUDA Software with Genetic Programming

Genetically Improved BarraCUDA

Genetic Improvement Programming

Evolving a CUDA Kernel from an nvidia Template

Genetic improvement of software: a case study

Evolving a CUDA Kernel from an nvidia Template

Genetic Improvement Demo: count blue pixels Goal: reduced runtime

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

Constructing an Optimisation Phase Using Grammatical Evolution. Brad Alexander and Michael Gratton

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Device Memories and Matrix Multiplication

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT. Bobby R. Bruce

Supplementary Information

Multi-Objective Higher Order Mutation Testing with Genetic Programming

Interpreting a genetic programming population on an nvidia Tesla

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Machine Evolution. Machine Evolution. Let s look at. Machine Evolution. Machine Evolution. Machine Evolution. Machine Evolution

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Image convolution with CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Accelerating CFD with Graphics Hardware

Accelerating Financial Applications on the GPU

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU & Reproductibility, predictability, Repetability, Determinism.

CUDA Performance Considerations (2 of 2)

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Evolutionary Computation. Chao Lan

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

A Cross-Input Adaptive Framework for GPU Program Optimizations

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Shared Memory and Synchronizations

Analyzing and improving performance portability of OpenCL applications via auto-tuning

Module Memory and Data Locality

Hugh Leather, Edwin Bonilla, Michael O'Boyle

Genetic Placement: Genie Algorithm Way Sern Shong ECE556 Final Project Fall 2004

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Very Fast Non-Dominated Sorting

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Unrolling parallel loops

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms

Supplementary Material for The Generalized PatchMatch Correspondence Algorithm

Dense matching GPU implementation

Height field ambient occlusion using CUDA

LDetector: A low overhead data race detector for GPU programs

Static Scene Reconstruction

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

B. Tech. Project Second Stage Report on

Bulldozers II. Tomasek,Hajic, Havranek, Taufer

Introduction to CUDA

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Designing a Domain-specific Language to Simulate Particles. dan bailey

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Code Optimizations for High Performance GPU Computing

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

A GPU-Enabled Rapid Image Processing Framework

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

gpot: Intelligent Compiler for GPGPU using Combinatorial Optimization Techniques

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

Using Virtual Texturing to Handle Massive Texture Data

Solving Classification Problems Using Genetic Programming Algorithms on GPUs

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

How to Optimize Geometric Multigrid Methods on GPUs

HMPP port. G. Colin de Verdière (CEA)

General Purpose GPU Computing in Partial Wave Analysis

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Tuning CUDA Applications for Fermi. Version 1.2

Using GPUs to compute the multilevel summation of electrostatic forces

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Hardware/Software Co-Design

High Performance Computing on GPUs using NVIDIA CUDA

Genetic programming. Lecture Genetic Programming. LISP as a GP language. LISP structure. S-expressions

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Addressing Heterogeneity in Manycore Applications

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Introduction to GPGPU and GPU-architectures

Global Optimization of a Magnetic Lattice using Genetic Algorithms

DRAM Bank Organization

Genetic Programming. Genetic Programming. Genetic Programming. Genetic Programming. Genetic Programming. Genetic Programming

Advanced CUDA Optimizations

Technology for a better society. hetcomp.com

GPU Performance Nuggets

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Hashcash Parallelization on GPGPU using OpenCL

Optimizing CUDA for GPU Architecture. CSInParallel Project

Project C: Genetic Algorithms

Real-time Graphics 9. GPGPU

Transcription:

Genetically Improved CUDA C++ Software W. B. Langdon Computer Science, UCL, London 26.4.2014

Genetically Improved CUDA C++ Software W. B. Langdon Centre for Research on Evolution, Search and Testing Computer Science, UCL, London GISMOE: Genetic Improvement of Software for Multiple Objectives

Real time stereo vision 200 frames 30fps=6.7sec Video W. B. Langdon, UCL 3

GP improved Tesla K20c 200 frames 710microsecs each Video W. B. Langdon, UCL 4

Genetic Programming to Improve Legacy Stereo Image Processing Software Target software: StereoCamera Target hardware: 6 graphics cards(gpus) Manual and auto changes. Tuning GP code changes Results: 1.05 to 6.8 times faster CUDA is nvidia s C++ dialect for nvidia s hardware 5

GP Automatic Coding Target non-trivial open source system stereo image processing GPU code. Tailor existing system for specific use: microsoft I2I 320 240 grey stereo pairs 6 different GPUs (16 to 2496 cores) Run existing code and GP back-to-back. Fitness: difference orginal code s answer and GP s speed of evolved code Remove bloat 6

Why StereoCamera StereoCamera is high quality CUDA code Since written (2007) changes to both hardware and software Target stereokernel Full kernel 276 lines CUDA Expanded existing code by hand XHALO, DPER Tailor existing system for specific cases: Microsoft I2I image database W. B. Langdon, UCL 7

CUDA GPU stereokernel C++ code For each left pixel find horizontal offset of best corresponding right pixel Not to scale. Min Sum (diff 2 11 11)

Six Types of nvidia GPUs Parallel Graphics Hardware Name year MP Cores Clock Quadro NVS 290 2007 1.1 2 8 16 0.92 GHz GeForce GTX 295 2009 1.3 30 8 240 1.24 GHz Tesla T10 2009 1.3 30 8 240 1.30 GHz Tesla C2050 2010 2.0 14 32 448 1.15 GHz GeForce GTX 580 2010 2.0 16 32 512 1.54 GHz Tesla K20c 2012 3.5 13 192 2496 0.71 GHz W. B. Langdon, UCL 9

Evolving stereokernel Convert code to grammar Grammar used to control modifications to code Genetic programming manipulates patches Copy/delete/insert lines of existing code Patch is small New kernel source is syntactically correct Essentially no compilation errors W. B. Langdon, UCL 10

Before GP As a result of initial GP, we discovered 2 Objectives: low error and fast, too different Existing code badly tuned Therefore: Abandoned multi-objective GP (NSGA-II fails) one objective: go faster with zero error Pre and post tune 2 or 3 key parameters GP now optimises both configuration parameters (12) and code (variable length) W. B. Langdon, UCL 11

Images split into tiles Each tile run in parallel STEREO_MAXD = 50 BLOCK_W = 64 (5 columns) ROWSperTHREAD = 40 (6 rows) tune 5 (48 rows) 48 5=240 (was 30)

GP Evolving Patches to CUDA W. B. Langdon, UCL 13

BNF Grammar for code changes if(x < width && Y < height) { dmin = d; Lines 326-328 Stereo_union.cuh <KStereo.cuh_326> ::= " if" <IF_KStereo.cuh_326> " \n" #"if <IF_KStereo.cuh_326> ::= "(X < width && Y < height)" <KStereo.cuh_327> ::= "{\n" <KStereo.cuh_328> ::= "" <_KStereo.cuh_328> "\n" #other <_KStereo.cuh_328> ::= "dmin = d;" Fragment of Grammar (Total 424 rules) 14

Grammar Rule Types Type indicated by rule name Replace rule only by another of same type 55 statement (eg assignment, Not declaration) 27 IF <KStereo.cuh_356> ::= " if" <IF_KStereo.cuh_356> " \n" <IF_KStereo.cuh_356> ::= "(dblockidx==0)" 10 for1, for2, for3 <KStereo.cuh_221> ::= <pragma_kstereo.cuh_221> "for(" <for1_kstereo.cuh_221> "; "OK()&& <for2_kstereo.cuh_221> ";" <for3_kstereo.cuh_221> ") \n" <pragma_kstereo.cuh_221> ::= "" <for1_kstereo.cuh_221> ::= "i = 0" <for2_kstereo.cuh_221> ::= "i<=(2*radius_h)" <for3_kstereo.cuh_221> ::= "i++" CUDA types 15

Representation 12 fixed configuration parameter: variable length list of grammar patches. uniform crossover on 12 fixed: tree like 2pt crossover on grammar changes. mutation change one of 12 configuration params or adds one randomly chosen grammar change 3 possible grammar changes: Delete line of source code (or replace by, 0) Replace with line of stereokernel (same type) Insert a copy of another stereokernel line W. B. Langdon, UCL 16

Example Mutating Grammar <IF_KStereo.cuh_326> ::= "(X < width && Y < height)" <IF_KStereo.cuh_154> ::= "(dblockidx==0)" 2 lines from grammar <IF_KStereo.cuh_326><IF_KStereo.cuh_154> Fragment of list of mutations Says replace line 326 by line 154 if(x < width && Y < height) Original code if(dblockidx==0) New code 17

StereoKernel Fitness Function Fitness run twice Run with error checking CUDA memcheck (GPU dependent) Force termination of excessive for loop iterations If ok, run again for accurate timing 18

Fitness Run patched stereokernel on 1 example image 96% compile and run ok Compare results with original answer Sort population by Error (actually only selected zero error) Kernel GPU clock ticks (minimise) Select top half of population. Mutate, crossover to give 2 children per parent. Repeat 50 generations Remove bloat Automatic tune again 19

Results GP run failed on old hardware memcheck problem? Patched/Tuned code run on 3010 images New kernels work for all. Always faster. Speed up depends on image size All 320 240 speed up as training cases Other sizes speed up some times less 20

stereokernel Results GPU name Original Pretuned Ratio GP Speedup Quadro NVS 290 27.0ms 26.0ms 1.05 GeForce GTX 295 5.4ms 1.5ms 3.6 Tesla T10 5.3ms 1.4ms 3.7 1.4ms 3.9 Tesla C2050 4.6ms 3.0ms 1.5 1.1ms 4.1 GeForce GTX 580 3.1ms 1.6ms 1.9 0.7ms 4.2 Tesla K20c 4.4ms 1.8ms 2.4 0.6ms 6.8 Milliseconds to process 320x240 stereo image pairs. Averages over 2516 pairs. NVS290 and GTX 295 hardware locked up in GP run. W. B. Langdon, UCL 21

GP can Improve Software Existing code provides It is its own defacto specification High quality starting code Framework for both: Functional fitness: does evolve code give right answers? (unlimited number of test cases) Performance: how fast, how much power, how reliable, Evolution has tuned old code for six very different graphics hardware and new software. W. B. Langdon, UCL 22

END http://www.cs.ucl.ac.uk/staff/w.langdon/ http://www.epsrc.ac.uk/ W. B. Langdon, UCL 23

Tesla K20c 6.8 times faster Fixed parameters Non-default configuration value ROWSperTHREAD 5 BLOCK_W 64 DPER enabled dperblock 2 XHALO enabled STORE_disparityMinSSD SHARED STORE_disparityPixel SHARED Evolution decided to enable DPER and XHALO and move disparityminssd and disparitypixel from global memory to shared memory. ROWSperTHREAD, BLOCK_W and dperblock set by post evolution auto tuner 24

Tesla K20c 6.8 times faster Code changes Remove CUDA code int * restrict disparityminssd, volatile extern attribute ((shared)) int col_ssd[]; volatile int* const reduce_ssd = &col_ssd[(64 )*2-64]; if(x < width && Y < height) syncthreads(); New CUDA code extern attribute ((shared)) int col_ssd[]; int* const reduce_ssd = &col_ssd[(64 )*2-64]; #pragma unroll 11 if(dblockidx==0) #pragma unroll 3 Parameter disparityminssd no longer needed as made shared (ie not global) All volatile removed Two #pragma inserted if() replaced syncthreads() removed 25