PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
|
|
- Theresa Underwood
- 6 years ago
- Views:
Transcription
1 PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC, Brazil 2 University of São Paulo, Brazil September 9, 2015 Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
2 Outline 1 Introduction 2 Proposed Algorithm 3 Implementation 4 Experimental Results Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
3 GPU Clusters Use of GPUs (Graphics Processing Units) is popular for HPC applications GPUs are composed of thousands of simple cores Cost-effective solution for data parallel and other highly parallelizable applications For increased parallelism, multiple GPUs can be allocated in GPU clusters For homogeneous clusters, tasks and/or data can be equally divided among GPUs Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
4 Heterogeneous Clusters Homogeneity is common on supercomputers and custom HPC clusters For clusters created using commodity machines, homogeneity is hard to maintain A new generation of hardware is launched every couple of years Researchers want to use all existing machines to increase the availability of resources Commom scenario in university labs The machines also have powerful multi-core CPUs Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
5 Load balancing Original data Partitioned data Machine The problem with heterogeneous machines is how to distribute the computation among the GPUs, so that no machine remains idle Load-balancing mechanism for all kinds of applications is not feasible We focus on data-parallel applications Data could be divided among GPUs using a domain decomposition model Each CPU/GPU would be responsible for part of the data Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
6 Load balancing in heterogeneous clusters A division of the load based on simple heuristics, such as the number of cores in the GPU, may be ineffective A solution is to use simple algorithms for task dispatching, such as greedy and work stealing Simple to implement and fast execution May result on suboptimal distributions More elaborate load-balancing algorithm cause a higher overhead, but compensate with better task distribution Performance profiles of tasks on each GPU type by execution measurements Used to determine the amount of work given to each GPU Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
7 Related work StarPU provide some general scheduling algorithms StarPU Greedy, work stealing, HEFT, etc. et al. (2012) iteratively searches for a good distribution of work among the available GPUs. At end of each synchronized iteration, check the relative finish s to define the RP (relative power) of each processor. If above a threshold, performs a rebalancing. Problem: slow convergence Belviranli et al. (2013) Heterogeneous Dynamic Self-Scheduler () Adaptive phase: Define a weight for each processing unit, by fitting a log curve to model performance measurement results Completion phase: Divides the remaining data among the GPUs based on their relative weights Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
8 Outline 1 Introduction 2 Proposed Algorithm 3 Implementation 4 Experimental Results Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
9 Overview Domain decomposition in data-parallel applications Data divided in smaller blocks Each block is processed independently in parallel Results are merged at the end Task of load-balancing algorithm determine data block sizes for each GPU and CPU i Processing unit performance modeling ii Block size selection iii Execution and rebalancing Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
10 Performance Modeling 1,0 f(x) = 1,49e-8x+0,01 1,0 f(x) = -0, ,07e-10x + 2,05e-16x² Time (s) 0,5 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (a) Black-Scholes (CPU) Time (s) 0,5 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (b) Black-Scholes (GPU) Devise a performance model for each processing unit based on execution measurements Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
11 Performance Modeling Relative block sizes for 4 processing units Devise a performance model for each processing unit based on execution measurements First step: sends a block of size size 0 to each unit Next steps: double block size to fastest unit Other units receive blocks size proportional to their speed Units should complete execution approximately at the same After four steps, the measurements will cover a large range of block sizes Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
12 Performance Modeling After four steps, fit a curve that model the execution behavior for the task on each processing unit. Use least-squares to find the best curve fit using: F p [x] = a 1 f 1 (x) + a 2 f 2 (x) a n f n (x) where f i (x) is one function of the set ln x, x, x 2, x 3, e x, x and the combinations x e x and x ln x For modeling the spent transmitting the data, we use G p [x] = a 1 x + a 2 The total execution is given by: E p [x] = F p [x] + G p [x] Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
13 Examples of Curve Fitting 1,0 f(x) = 1,49e-8x+0,01 1,0 f(x) = -0, ,07e-10x + 2,05e-16x² Time (s) 0,5 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (c) Black-Scholes (CPU) Time (s) 0,5 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (d) Black-Scholes (GPU) Time (s) 1,0 0,5 f(x) = -0, ,84e-9x + 1,46e-16x² Time (s) f(x) = 0, ,56e-9x + 5,55e-17x² 0,6 0,4 0,2 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (e) Matrix multiplication (CPU) 0,0 0,0 2,0x10 7 4,0x10 7 6,0x10 7 Block Size (KB) (f) Matrix multiplication (GPU) Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
14 Load-balancing Algorithm size size Finds X 1, X 2,..., X n size size Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
15 Block Size Selection Determine the set X of block sizes x g for each processor n X = {x g R : [0, 1] x g = 1} g=1 Same execution E k (x k ) on each processor k E 1 (x 1 ) = E 2 (x 2 ) = = E n (x n ) To find X, we need to solve the following set of equations: E 1 (x 1 ) = F 1 (x 1 ) + G 1 (x 1 ) E 2 (x 2 ) = F 2 (x 2 ) + G 2 (x 2 )... E n (x n ) = F n (x n ) + G n (x n ) (1) Apply an interior point line search filter method Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
16 Execution phase size size Finds X 1, X 2,..., X n size size Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
17 Execution phase The scheduler sends a block of the selected size x g for each processing unit g. x g is a floating-point number, which is rounded to the closest valid block size. When a processing unit finishes executing a task, it requests another task of the same size. The processing units continue until completing all the tasks. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
18 Rebalancing Threshold R eb a l a n ci n g... Local Computation The scheduler also monitors the finish of each task. If the difference in finishing s t i t j threshold, i, j, it recalculates the block sizes Includes the newly generated points Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
19 size size size size size size size size Load-balancing Algorithm Finds X 1, X 2,..., X n Rebalancing Finds X 1, X 2,..., X n Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
20 Outline 1 Introduction 2 Proposed Algorithm 3 Implementation 4 Experimental Results Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
21 StarPU StarPU a task-based programming library for hybrid architectures Run layer that manages the execution of tasks and data transferring between processing units Support for CPU, GPU, and Xeon Phi implementations using codelets Offers an API that allows the implementation of new scheduling policies Default scheduling strategy is the greedy one Implemented over the StarPU framework Also implemented the (complex one) and algorithms (simple one) Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
22 Applications We ported three applications to the StarPU Framework GPU and CPU implementations using CUDA and OpenMP Matrix Multiplication We used the optimized version from the CUBLAS 4.0 library Computational complexity: O(n 3 ) for an n n matrix. Gene Regulatory Network (GRN) inference Exhaustive search of the gene subset with a given cardinality k that best predict a target gene Computational complexity: O(n k ), where n is the number of genes. Black-Scholes financial analysis algorithm Estimates the future values of options using an stochastic differential equation Computational complexity: O(n), where n is the number of options. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
23 Outline 1 Introduction 2 Proposed Algorithm 3 Implementation 4 Experimental Results Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
24 Machine Configurations Table: Machine configurations Machine A B C D Description Xeon E5-2690V GHz 25 MB cache 256 GB Tesla K20c 2496 / 13 SMs 205 GB/s 6 GB Intel i GHz 8 MB cache 8 GB GTX x 240 cores / 30 SMs GB/s 896 MB Intel i7 4930K GHz 12 MB cache 32 GB GTX x 1536 cores / 8 SMs GB/s 2 GB Intel i7 3939K GHz 12 MB cache 32 GB GTX Titan 2688 cores / 14 SMs GB/s 6 GB Setups: 1 machine [A]; 2 machines [A, B]; 3 machines [A, B, C]; 4 machines[a, B, C, D] Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
25 Execution and Speedup: Matrix Multiplication Time (s) Greedy x10,000 Matrix Size n (a) 1 machine Time (s) Greedy x10,000 Matrix Size n (b) 2 machines Time (s) Greedy x10,000 Matrix Size n (c) 3 machines Time (s) Greedy x10, Matrix Size n (d) 4 machines Speedup 1,0 0, x10,000 Matrix Size n Speedup 1,2 1,1 1,0 0, x10,000 Matrix Size n Speedup 1,6 1,4 1,2 1,0 0, x10,000 Matrix Size n Speedup x10,000 Matrix Size n (e) 1 machine (f) 2 machines (g) 3 machines (h) 4 machines Figure: Execution and speedup (compared to the Greedy algorithm) for the Matrix Multiplication application, using different number of machines and input sizes. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
26 Execution and Speedup: GRN Inference Time (s) Greedy x10,000 Number of Genes (a) 1 machine Time (s) Greedy 0 x10, Number of Genes (b) 2 machines Time (s) Greedy x10,000 Number of Genes (c) 3 machines Time (s) Greedy x10,000 Number of Genes (d) 4 machines Speedup 1,04 1,02 1,00 0, x10,000 Number of Genes (e) 1 machine Speedup 1,14 1,12 1,10 1,08 1,06 1,04 1,02 1,00 0, x10,000 Number of Genes (f) 2 machines Speedup 1,4 1,2 1, x10,000 Number of Genes (g) 3 machines Speedup 1,8 1,6 1,4 1,2 1, x10,000 Number of Genes (h) 4 machines Figure: Execution and speedup (compared to the Greedy algorithm) for the Gene Regulatory Network (GRN) inference application, using different number of machines and input sizes. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
27 Block Size Distribution Size Ratio 0,35 0,30 0,25 0,20 0,15 0,10 0,05 Machine A Machine B Machine C Machine D PLB-HeC PLB-HeC PLB-HeC PLB-HeC Size Ratio 0,35 0,30 0,25 0,20 0,15 0,10 0,05 Machine A Machine B Machine C Machine D PLB-HeC PLB-HeC PLB-HeC PLB-HeC 0,00 CPU GPU CPU GPU 4096 elements elements 0,00 CPU genes GPU CPU genes GPU (a) Matrix Multiplication (b) GRN Inference Figure: Block size distribution among the processing units (CPU and GPU) from the four machines, for the Matrix Multiplication and GRN applications, using two different input sizes for each. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
28 Processing Units Idle Times Idle relative execution (%) PLB-HeC PLB-HeC PLB-HeC CPU GPU CPU GPU 4096 elements elements Machine A Machine B Machine C Machine D PLB-HeC Idle relative execution (%) PLB-HeC PLB-HeC PLB-HeC Machine A Machine B Machine C Machine D CPU GPU CPU GPU genes genes PLB-HeC (a) Matrix Multiplication (b) GRN Inference Figure: Processing unit idle in relation to the total execution for the Matrix Multiplication and GRN applications, using two different input sizes for each. Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
29 Conclusions : dynamic load-balancing for heterogeneous clusters Profile-based online performance modeling and precise block size selection by solving a non-linear system of equations Improved execution, specially with more heterogeneous clusters and larger problems sizes Future work: Shared clusters or clouds: rebalancing mechanism would compensate for quality of service changes during execution Fault-tolerance: execution could continue, with a new block distribution using the performance model Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
30 Questions? Work financed by: Raphael Camargo (UFABC) : Profile-based Load-Balancing September 9, / 30
NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationGViM: GPU-accelerated Virtual Machines
GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationAccelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms
Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationParallel Computing Concepts. CSInParallel Project
Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................
More informationLoad Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs
Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationSimulation of Conditional Value-at-Risk via Parallelism on Graphic Processing Units
Simulation of Conditional Value-at-Risk via Parallelism on Graphic Processing Units Hai Lan Dept. of Management Science Shanghai Jiao Tong University Shanghai, 252, China July 22, 22 H. Lan (SJTU) Simulation
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationClustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!
RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationOpen Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,
Open Compute Stack (OpenCS) Overview D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, http://www.daetools.com/opencs What is OpenCS? A framework for: Platform-independent model specification 1.
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAccelerating HPL on Heterogeneous GPU Clusters
Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationAnalysis and Visualization Algorithms in VMD
1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics
More informationPluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud
Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud Outline PAI(Platform of Artificial Intelligence) PAI Overview Deep Learning with PAI Pluto
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationParallelism on Hybrid Metaheuristics for Vector Autoregression Models
Parallelism on Hybrid Metaheuristics for Vector Autoregression Models Alfonso L. Castaño, Javier Cuenca, José-Matías Cutillas-Lozano, Domingo Giménez Scientific Computing and Parallel Programming Group,
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationJoe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)
1 of 11 5/4/2011 4:49 PM Joe Wingbermuehle, wingbej@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download The Auto-Pipe system allows one to evaluate various resource mappings and topologies
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationA Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms
A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationA MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University
A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationScheduling in Heterogeneous Computing Environments for Proximity Queries
1 Scheduling in Heterogeneous Computing Environments for Proximity Queries Duksu Kim, Member, IEEE, Jinkyu Lee, Member, IEEE, Junghwan Lee, Member, IEEE, Insik Shin, Member, IEEE, John Kim, Member, IEEE,
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationPostprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationAdaptive Power Profiling for Many-Core HPC Architectures
Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationPORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT
PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT Jean-Yves VET, DDN Storage Patrick CARRIBAULT, CEA Albert COHEN, INRIA CEA, DAM, DIF, F-91297
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationExperiences Using Tegra K1 and X1 for Highly Energy Efficient Computing
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian
More informationVIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei
VIAF: Verification-based Integrity Assurance Framework for MapReduce YongzhiWang, JinpengWei MapReduce in Brief Satisfying the demand for large scale data processing It is a parallel programming model
More information