High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
|
|
- Peregrine Horn
- 5 years ago
- Views:
Transcription
1 High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo
2 Why TSP? It s hard (NP-hard to be exact) Especially to parallelize But easy to understand A basic problem in CS TSPLIB - a set of standard problems Allows comparisons of algorithms It s simple to adjust the problem size Platform for the study of general methods that can be applied to a wide range of discrete optimization problems
3 Not only salesmen s headache Computer wiring Vehicle routing Crystallography Robot control pla85900.tsp It took 15 years to solve this problem (solved in 2006)
4
5 Local/Global Optimization Usually the space is huge There are many valleys and local minima Local search might terminate in one of the local minima Global search should find the best solution
6 Iterated Global Search - Global Optimization (Approximate algorithm) Local search is followed by some sort of tour randomization and it s repeated (until there s time left) perturbation Most of the time is spent in the Local Search part cost s % % % % % % % % % % % % % % s* s* % % solution space S % % % % % kroe100 ch130 ch150 kroa200 ts225 pr299 pr439 rat783 vm1084 pr2392 Thomas Stützle, Iterated Local Search & Variable Neighborhood Search MN Summerschool, Tenerife, 2003 p.9
7 Iterated Global Search - Global Optimization (Approximate algorithm) 1: procedure I TERATED L OCAL S EARCH 2: s 0 := GenerateInitialSolution() 3: s * := 2optLocalSearch (s 0 ) GPU 4: while (termination condition not met) 5: s := P erturbation (s * ) 6: s * := 2optLocalSearch (s ) GPU 7: s * := AcceptanceCriterion (s *, s * ) 8: end while 9: end procedure
8 TSP - Big valley theory (Why does ILS work?) Local minima form clusters The better the solution, the closer it is to the global optimum x Cost Cost Global minimum is somewhere in the middle x Mean distance to other solutions (a) Distance to optimal (b) ILS worsens the tour only slightly Local Search is repeated close to the previous minimum S. Baluja, A.G. Barto, K. D. Boese, J. Boyan, W. Buntine, T. Carson, R. Caruana, D. J. Cook, S. Davies, T. Dean, T. G. Dietterich, P. J. Gmytrasiewicz, S. Hazlehurst, R. Impagliazzo, A.K. Jagota, K. E. Kim, A. Mcgovern, R. Moll, A. W. Moore, L. Sanchis, L. Su, C. Tseng, K. Tumer, X. Wang, D. H. Wolpert, Justin Boyan, Wray Buntine, Arun Jagota: Statistical Machine Learning for Large-Scale Optimization, 2000
9 2-opt local optimization Exchange a pair of edges to obtain a better solution if the following condition is true distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) A B A B C C D D E E F F G G
10 2-opt local optimization In order to find such a pair need to perform: n 2 for (int i = 1; i < points - 2; i++) for (int j = i + 1; j < points - 1; j++) 2 =(n 2) (n 3)/2 checks { if (distance(i, i-1) + distance(j+1, j) > There are some pruning techniques in more complex algorithms I analyze all possible pairs distance(i, j+1) + distance(i-1, j)) update best i and j; city problem -> 49.9 Million pairs city problem -> 5 Billion pairs } remove edges (best i, best i-1) and (best j+1, best j) add edges (best i,best j+1) and (best i-1, best j)
11 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
12 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a city problem it can be slower Cache Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
13 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a city problem it can be slower Cache With multiple cores, it becomes slower Cache/memory Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
14 LUT (Look Up Table) vs recalculation Not-so-clever algorithm Time Clever algorithm Number of cores
15 A change in algorithm design - sequential Memory is free, reuse computed results Use sophisticated algorithms, prune the search space
16 A change in algorithm design - parallel Memory is free, reuse computed results Limited memory (size, throughput), computing is free (almost) Use sophisticated algorithms, prune the search space Sometimes brute-force search might be faster (especially on GPUs) Avoiding divergence Same amount of time to process 10, 100 even elements in parallel But -> empty flops
17 GPU implementation GPU uses this simple function for EUC_2D problems device int calculatedistance2d (unsigned int i, unsigned int j, city_coords* coords) { //registers + shared memory register float dx, dy; dx = coords[i].x - coords[j].x; dy = coords[i].y - coords[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f); //6 FLOPs + sqrt
18 GPU implementation dist(i,j) = dist (j,i), dist (i,i) = 0 Example for 32 threads Matrix of city pairs to be checked for a possible swap, each pair corresponds to one GPU job N 0,1 0 0,2 1 1,2 2 0,3 3 1,3 4 2,3 5 0,4 6 1,4 7 2,4 8 3,4 9 0,5 10 1,5 11 2,5 12 3,5 13 4,5 14 0,6 15 1,6 16 2,6 17 3,6 18 4,6 19 5,6 20 0,7 21 1,7 22 2,7 23 3,7 24 4,7 25 5,7 26 6,7 27 Iteration 1 Iteration 2 0,8 28 1,8 29 2,8 30 3,8 31 4,8 32 5,8 33 6,8 34 7,8 35 0,9 37 1,9 38 2,9 39 3,9 40 4,9 41 5,9 42 6,9 43 7,9 44 8,9 45 Iteration 3 N N Iteration 4 Each thread has to check: Iteration 5 if (distance(i, i-1) + distance(j+1, j) > Iteration 6 distance(i, j+1) + distance(i-1, j)) update best i and j; N Iteration 7
19 Optimizations Preorder the coordinates in the route s order on CPU The reads from the global memory are no longer scattered - much faster 8B needed instead of 10B per city Route Array of coordinates Get coordinates of point at position n x = coordinates[route[n]].x y = coordinates[route[n]].y Total size: n * sizeof(route data type) + 2 * n * sizeof(coordinate data type) No unnecessary address calculations The problem can be split if it s too big for the shared memory s size Route + coordinates = Array of ordered coordinates Get coordinates of point at position n x = ordered_coordinates[n].x y = ordered_coordinates[n].y Total size: 2 * n * sizeof(coordinate data type)
20 GPU implementation An overview of the algorithm: 1. Copy the ordered coordinates to the GPU global memory (CPU) 2. Execute the kernel 3. Copy the ordered coordinates to the shared memory 4. Calculate swap effect of one pair 5. Find the best value and store it in the global memory 6. Read the result (CPU)
21 What if coordinates size exceeds the size of shared memory? This part requires 2*m float2 elements to be calculated Let s assume a big problem N too big for the shared memory This part requires 2*m float2 elements to be calculated This part requires 2*m float2 elements to be calculated N coordinates range (N-2m+1,N-m) coordinates range (N-m+1,N) N coordinates range (0,m) coordinates range (m+1,2m)
22 What if coordinates size exceeds the size of shared memory? Problem of size N Assuming local memory of limited size m coordinates Needed data: two arrays coordinates range (m/2+1,m) 0 N coordinates range (N-m/2+1,N) 0 N coordinates range (N-m/2+1,N) Total size: 2 * 2 * n * sizeof(coordinate data type) coordinates range (m/2+1,m) It is possible to calculate distances in arbitrary segments
23 What if coordinates size exceeds the size of shared memory? Arbitrarily big problems can be solved 2 sets of coordinates needed Work can be split and run on multiple devices device int calculatedistance2d( unsigned int i, unsigned int j, city_coords* coordsa, city_coords* coordsb) { register float dx, dy; dx = coordsa[i].x - coordsb[j].x; dy = coordsa[i].y - coordsb[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f);
24 Results 2-opt step processing time 10,000s Approximately 22 billion comparisons/s 1,000s (3960X) A distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) B A B 100s C C D D 10s E E F F 1s G G 0.1s 0.01s For a city problem 0.001s s pair need to be checked s kroe100 kroa200 pr439 pr2392 fnl4461 d15112 pla85900 ara tsp
25 Results Time needed for a single local optimization run GTX 680 Compared to Intel i7-3960x CPU Up to 300x speedup (single core, no SSE) Up to 40x speedup (Intel OpenCL) Table II 2-OPT - TIME NEEDED FOR A SINGLE RUN, CUDA, GPU: GTX680, CPU: I NTEL I X Problem GPU Host to Device GPU 2-opt Time to Initial Optimized (TSPLIB) kernel device to host total checks/s first Length Length time copy copy time (Millions) minimum full 2-opt time time 2-opt from MF berlin52 20 µs 50 µs 11 µs 81 µs µs kroe µs 50 µs 11 µs 82 µs µs ch µs 50 µs 11 µs 82 µs µs ch µs 50 µs 11 µs 84 µs µs kroa µs 50 µs 11 µs 85 µs ms ts µs 50 µs 11 µs 85µs ms pr µs 50 µs 11 µs 87 µs ms pr µs 50 µs 11 µs 93 µs ms rat µs 51 µs 11 µs 115 µs ms vm µs 51 µs 11 µs 142 µs ms pr µs 53 µs 11 µs 363 µs ms pcb µs 55 µs 11 µs 547 µs ms fl µs 54 µs 11 µs 788 µs ms fnl µs 58 µs 11 µs 815 µs ms rl µs 59 µs 11 µs 1079 µs ms pla µs 58 µs 11 µs 1616 µs ms usa µs 66 µs 11 µs 4805 µs s d µs 69 µs 11 µs 6043 µs s d µs 75 µs 11 µs 9014 µs s sw ms 85 µs 11 µs ms s pla ms 96 µs 11 µs ms s pla ms 205 µs 11 µs ms s sra ms 249 µs 11 µs ms s usa ms 287 µs 11 µs ms s ara s 740 µs 11 µs s s lra s 1774 µs 11 µs s m lrb s 2833 µs 11 µs s h
26 Results 1 Kepler GPUs 2 Kepler GPUs 4 Kepler GPUs 8 Kepler GPUs
27 I have implemented and published LOGO TSP Solver LOGO - LOcal GPU Optimization CUDA + OpenCL Linux/Mac: Source code, Windows: Binary with GUI GPL License + More resources about LOGO (slides, papers)
28 Summary Big TSP problems can be solved very quickly on GPU! (The fastest GPU TSP Solver?) Sometimes naive algorithms might run faster overall in parallel Large and very large problems typically can be decomposed into smaller ones and fit fast memory Avoid memory accesses (true for both CPUs and GPUs now) Memory is a scarce resource, FLOPs are abundant Exploit locality by manually reordering data (additional 0.1s of preprocessing may save 1s somewhere else)
High Performance GPU Accelerated Local Optimization in TSP
2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum High Performance GPU Accelerated Local Optimization in TSP Kamil Rocki, Reiji Suda The University of
More informationThe Traveling Salesman Problem: State of the Art
The Traveling Salesman Problem: State of the Art Thomas Stützle stuetzle@informatik.tu-darmstadt.de http://www.intellektik.informatik.tu-darmstadt.de/ tom. Darmstadt University of Technology Department
More informationa local optimum is encountered in such a way that further improvement steps become possible.
Dynamic Local Search I Key Idea: Modify the evaluation function whenever a local optimum is encountered in such a way that further improvement steps become possible. I Associate penalty weights (penalties)
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationunderlying iterative best improvement procedure based on tabu attributes. Heuristic Optimization
Tabu Search Key idea: Use aspects of search history (memory) to escape from local minima. Simple Tabu Search: I Associate tabu attributes with candidate solutions or solution components. I Forbid steps
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationHEURISTICS optimization algorithms like 2-opt or 3-opt
Parallel 2-Opt Local Search on GPU Wen-Bao Qiao, Jean-Charles Créput Abstract To accelerate the solution for large scale traveling salesman problems (TSP), a parallel 2-opt local search algorithm with
More informationWarped parallel nearest neighbor searches using kd-trees
Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:
More informationA NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM
TWMS J. App. Eng. Math. V.7, N.1, 2017, pp. 101-109 A NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM F. NURIYEVA 1, G. KIZILATES 2, Abstract. The Multiple Traveling Salesman Problem (mtsp)
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationComplete Local Search with Memory
Complete Local Search with Memory Diptesh Ghosh Gerard Sierksma SOM-theme A Primary Processes within Firms Abstract Neighborhood search heuristics like local search and its variants are some of the most
More informationMonte Carlo Simplification Model for Traveling Salesman Problem
Appl. Math. Inf. Sci. 9, No. 2, 721-727 (215) 721 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/1.12785/amis/922 Monte Carlo Simplification Model for Traveling Salesman
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationACCELERATING THE ANT COLONY OPTIMIZATION
ACCELERATING THE ANT COLONY OPTIMIZATION BY SMART ANTS, USING GENETIC OPERATOR Hassan Ismkhan Department of Computer Engineering, University of Bonab, Bonab, East Azerbaijan, Iran H.Ismkhan@bonabu.ac.ir
More informationSLS Methods: An Overview
HEURSTC OPTMZATON SLS Methods: An Overview adapted from slides for SLS:FA, Chapter 2 Outline 1. Constructive Heuristics (Revisited) 2. terative mprovement (Revisited) 3. Simple SLS Methods 4. Hybrid SLS
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationTwo new variants of Christofides heuristic for the Static TSP and a computational study of a nearest neighbor approach for the Dynamic TSP
Two new variants of Christofides heuristic for the Static TSP and a computational study of a nearest neighbor approach for the Dynamic TSP Orlis Christos Kartsiotis George Samaras Nikolaos Margaritis Konstantinos
More informationCS 179: GPU Programming. Lecture 7
CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)
More informationOutline. TABU search and Iterated Local Search classical OR methods. Traveling Salesman Problem (TSP) 2-opt
TABU search and Iterated Local Search classical OR methods Outline TSP optimization problem Tabu Search (TS) (most important) Iterated Local Search (ILS) tks@imm.dtu.dk Informatics and Mathematical Modeling
More informationTABU search and Iterated Local Search classical OR methods
TABU search and Iterated Local Search classical OR methods tks@imm.dtu.dk Informatics and Mathematical Modeling Technical University of Denmark 1 Outline TSP optimization problem Tabu Search (TS) (most
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationA Tabu Search Heuristic for the Generalized Traveling Salesman Problem
A Tabu Search Heuristic for the Generalized Traveling Salesman Problem Jacques Renaud 1,2 Frédéric Semet 3,4 1. Université Laval 2. Centre de Recherche sur les Technologies de l Organisation Réseau 3.
More informationSparse Training Data Tutorial of Parameter Server
Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job
More informationFrom Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationGPU Acceleration for Seismic Interpretation Algorithms
GPU Acceleration for Seismic Interpretation Algorithms Jon Marbach, Ph.D. Pate Motter TerraSpark Geosciences 2012 TerraSpark Geosciences, LLC. All Rights Reserved. Talk overview Who am I? (~1min) Who is
More informationAccelerating local search algorithms for the travelling salesman problem through the effective use of GPU
Available online at www.sciencedirect.com ScienceDirect Transportation Research Procedia 22 (2017) 409 418 www.elsevier.com/locate/procedia 19th EURO Working Group on Transportation Meeting, EWGT2016,
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationOpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances
OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationL20: Putting it together: Tree Search (Ch. 6)!
Administrative L20: Putting it together: Tree Search (Ch. 6)! November 29, 2011! Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) - Goal is to prepare you for final - We ll discuss it in class on Thursday
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationLarge Scale Parallel Monte Carlo Tree Search on GPU
Large Scale Parallel Monte Carlo Tree Search on Kamil Rocki The University of Tokyo Graduate School of Information Science and Technology Department of Computer Science 1 Tree search Finding a solution
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationA Memetic Algorithm for the Generalized Traveling Salesman Problem
A Memetic Algorithm for the Generalized Traveling Salesman Problem Gregory Gutin Daniel Karapetyan Abstract The generalized traveling salesman problem (GTSP) is an extension of the well-known traveling
More informationCandidate set parallelization strategies for Ant Colony Optimization on the GPU
Candidate set parallelization strategies for Ant Colony Optimization on the GPU Laurence Dawson and Iain A. Stewart School of Engineering and Computing Sciences, Durham University, Durham, United Kingdom,
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationA Parallel Architecture for the Generalized Traveling Salesman Problem
A Parallel Architecture for the Generalized Traveling Salesman Problem Max Scharrenbroich AMSC 663 Project Proposal Advisor: Dr. Bruce L. Golden R. H. Smith School of Business 1 Background and Introduction
More informationParallel Implementation of Travelling Salesman Problem using Ant Colony Optimization
Parallel Implementation of Travelling Salesman Problem using Ant Colony Optimization Gaurav Bhardwaj Department of Computer Science and Engineering Maulana Azad National Institute of Technology Bhopal,
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationParallelization Strategies for Local Search Algorithms on Graphics Processing Units
Parallelization Strategies for Local Search Algorithms on Graphics Processing Units Audrey Delévacq, Pierre Delisle, and Michaël Krajecki CReSTIC, Université de Reims Champagne-Ardenne, Reims, France Abstract
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationIntroduction to Algorithms: Brute-Force Algorithms
Introduction to Algorithms: Brute-Force Algorithms Introduction to Algorithms Brute Force Powering a Number Selection Sort Exhaustive Search 0/1 Knapsack Problem Assignment Problem CS 421 - Analysis of
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationDesign and Analysis of Algorithms CS404/504. Razvan Bunescu School of EECS.
Welcome Design and Analysis of Algorithms Razvan Bunescu School of EECS bunescu@ohio.edu 1 Course Description Course Description: This course provides an introduction to the modern study of computer algorithms.
More informationFundamentals of Computer Science II
Fundamentals of Computer Science II Keith Vertanen Museum 102 kvertanen@mtech.edu http://katie.mtech.edu/classes/csci136 CSCI 136: Fundamentals of Computer Science II Keith Vertanen Textbook Resources
More informationCache Performance II 1
Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation
More informationA memetic algorithm for symmetric traveling salesman problem
ISSN 1750-9653, England, UK International Journal of Management Science and Engineering Management Vol. 3 (2008) No. 4, pp. 275-283 A memetic algorithm for symmetric traveling salesman problem Keivan Ghoseiri
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationRESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU
The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 21 RESEARCH ARTICLE Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationParallel Implementation of the Max_Min Ant System for the Travelling Salesman Problem on GPU
Parallel Implementation of the Max_Min Ant System for the Travelling Salesman Problem on GPU Gaurav Bhardwaj Department of Computer Science and Engineering Maulana Azad National Institute of Technology
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationNote: In physical process (e.g., annealing of metals), perfect ground states are achieved by very slow lowering of temperature.
Simulated Annealing Key idea: Vary temperature parameter, i.e., probability of accepting worsening moves, in Probabilistic Iterative Improvement according to annealing schedule (aka cooling schedule).
More informationAccelerating high-end compositing with CUDA in NUKE. Jon Wadelton NUKE Product Manager
Accelerating high-end compositing with CUDA in NUKE Jon Wadelton NUKE Product Manager 2 Overview What is NUKE? Image processing - exploiting the GPU The Foundry Approach Simple examples in NUKEX Real world
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationExploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationA Parallel GPU Version of the Traveling Salesman Problem
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O Neil, Dan Tamir, and Martin Burtscher Department of Computer Science, Texas State University, San Marcos, TX Abstract - This paper describes
More informationAnt Colony Optimization Parallel Algorithm for GPU
Ant Colony Optimization Parallel Algorithm for GPU Honours Project - COMP 4905 Carleton University Karim Tantawy 100710608 Supervisor: Dr. Tony White, School of Computer Science April 10 th 2011 Abstract
More informationGraph Applications, Class Notes, CS 3137 1 Traveling Salesperson Problem Web References: http://www.tsp.gatech.edu/index.html http://www-e.uni-magdeburg.de/mertens/tsp/tsp.html TSP applets A Hamiltonian
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationTravelling salesman problem using reduced algorithmic Branch and bound approach P. Ranjana Hindustan Institute of Technology and Science
Volume 118 No. 20 2018, 419-424 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Travelling salesman problem using reduced algorithmic Branch and bound approach P. Ranjana Hindustan
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationAccelerating Dynamic Binary Translation with GPUs
Accelerating Dynamic Binary Translation with GPUs Chung Hwan Kim, Srikanth Manikarnike, Vaibhav Sharma, Eric Eide, Robert Ricci School of Computing, University of Utah {chunghwn,smanikar,vaibhavs,eeide,ricci}@utah.edu
More information