Predictive Runtime Code Scheduling for Heterogeneous Architectures

Size: px
Start display at page:

Download "Predictive Runtime Code Scheduling for Heterogeneous Architectures"

Transcription

1 Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th,

2 Outline Motivation Multiversioning for homogeneous architectures Heterogeneous architectures (CPU + GPU) Predictive code scheduling Experiments and results Conclusions and future work 2

3 Multiversioning for Homogeneous Architectures runtime func1 func1 func compile func2 func3 runtime func2 func3 Code multiversioning has been proved useful for homogeneous architectures Dynamic adaptation to system workload changes Best performance across different microarchitectures Reducing search time for evaluating program optimizations 3

4 Multiversioning for Heterogeneous Architectures Heterogeneous arch. add a new dimension Processing elements (PE) with different characteristics Applications behave different on each PE Different ISAs may coexist Opportunity for dynamic code scheduling Adaptability: allows to optimize system parameters (performance, power, etc.) Requires system profiling information Benefits: adaptive applications and libraries 4

5 Current Limitations Traditional view for heterogeneous architectures Accelerators (intensive computation): GPU Control processor (accelerator management): CPU Waste of processing power Benefit from dynamic scheduling using all the PEs. Example: IBM Roadrunner (~7k Opteron, ~13k Cell) GPUs are still running a single application Traditional usage for games and visualization Open to new applications thanks to CUDA Multiprogramming becomes a need 5

6 Objectives System throughput improvement using dynamic code scheduling on a CPU/GPU system Fully use all its computation power (CPU and GPU) To allow multiprogramming in a CPU/GPU architecture Study scheduling algorithms for such a system 6

7 Dynamic Predictive Scheduling Why do we need dynamic predictive scheduling? Best fitting PE depends on the code, the input and the system current workload Example (matrix multiplication on single-cpu/gpu): 7 Matrix multiplication

8 Scheduler Design Framework for a multiprogrammed environment process process User-level Shared library Simplified development scheduling runtime OS HW 8

9 Scheduler Design Granularity Function versioning Function-level process process funccpu funcgpu funccpu scheduling runtime Orthogonal to the scheduling problem T2 T1 Task queue funcgpu func time f1 t1 f2 t2 History table Explicit data transfer OS HW CPU GPU 9

10 Usage Example 1. Obtain both the CPU and the GPU versions from a current implementation void matmul(float* A, float *B, float *C, int N); void matmul_cpu(float* A, float* B, float *C, int N); void matmul_gpu(float* A, float* B, float *C, int N); versioning (CUDA, MCUDA) 2. Change calls to that function by a request to the scheduler (+ wrapper) CallScheduler* cs = CallScheduler::getInstance(); MatmulFunc* f = new MatmulFunc; MatmulArgs* a = new MatmulArgs(A,B,C,N); cs->schedule(f,a); matmul(a,b,c,n); compiler-generated original code resulting code class MatmulFunc : public Func { void run(pe::type petype, Args* args) { MatmulArgs* a = (MatmulArgs*) args; switch (petype) { case PE::CPU: matmul_cpu(a->a,a->b,a->c,a->n); break; case PE::GPU: matmul_gpu(a->a,a->b,a->c,a->n); break; } } }; 10

11 Scheduling Algorithms 1.First-free algorithms Select the first-free PE in the system 2.Predictive algorithms CPU/GPU performance prediction Waiting time prediction (load balancing) 11

12 First-free Algorithms 1.First-free FIFO GPU (ff-fifo-gpu) Select the first free PE (if all busy, then select the GPU) 2.First-free FIFO round-robin (ff-fifo-rrk:1,k) Select one PE k times for each time the other is selected 12

13 Predictive Algorithms History GPU (history-gpu) Maintain a performance history table for every pair <function,pe> Compute a PE set where the task is allowed to execute Select the first free PE from that set. In case all are busy, select the GPU Avoid choosing a poor <function,pe> combination CPU history GPU history func time matmul timematmul fft timefft CPU GPU func time matmul timematmul fft timefft allowed PE ={ pei j : time f, pe i k time f, pe j } 13

14 Predictive Algorithms Estimate history (estimate-hist) 1.Performance history table for every pair <function,pe> 2.Predict the waiting time for a new task in each PE's queue 3.Assign the task to the queue with the minimum waiting time Tries to reduce load unbalance CPU history func time matmul timematmul fft timefft GPU history CPU GPU T2 T1 CPU task queue T3 T2 T1 GPU task queue func time matmul timematmul fft timefft 14

15 Methodology Benchmarks 1.Parboil Benchmark Suite (UIUC) cp (VMD) sad (H264) 2.FTDock (protein alignment) Benchmark's shortname Cp Sad Ftdock Matmul Uses Fast Fourier Transform 3.Matrix multiplication System Intel Core 2 2.4Ghz NVIDIA Geforce 8600 GTS / 8800 GTX (PCIe x16) Red Hat Enterprise Linux 5 (2.6.18) 15

16 Methodology Concurrent execution of benchmarks combinations (NMP: multiprogramming level) Example: mmfs (NMP = 4) Representative subset from the whole set Measures Execution time (wall-clock time, speedup over the GPU in batch execution mode) Traces scheduler overhead 16

17 CPU-GPU multiprogrammed performance Benchmarks typically perform better in GPU Some combinations perform similar in CPU/GPU 17

18 Performance Speedup for NMP=4 First-free algorithms have a non-constant performance Predictive algorithms achieve up to 20-30% speedup over batch mode execution on the GPU Baseline is GPU batch execution 18

19 Performance Speedup for NMP=6 Predictive algorithms increase the speedup up to 30-40% estimate-hist does not overperform history-gpu Baseline is GPU batch execution 19

20 PE's Queue Usage PE usage is increased with more concurrent tasks 20 Number of tasks waiting in PEs queues upon the arrival of a new task

21 Task Distribution Prediction algorithms achieve a better CPU/GPU balance 21 Task execution distribution between both PEs (CPU-GPU)

22 Conclusions CPU/GPU multiprogrammed environment Multiversioning + dynamic scheduling Dynamic scheduling improves system throughput Up to 40% speedup over batch execution on the GPU Throughput increases as NMP does Predictive algorithms outperforms first-free ones Scheduler overhead is less than 0.1% 22

23 Future Work Analyze the influence of data inputs on scheduling and performance Add data input characteristics and run-time performance information (HW counters) to train the predictive model Collective profiling from multiple users Increase prediction accuracy Integrate our predictive scheduler with the Linux kernel for greater flexibility 23

24 Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th,

25 Backup Slides 25

26 Heterogeneous Architectures CPU + NVIDIA G80 Processing elements with different purposes and/or characteristics NVIDIA G80 (CUDA) General-purpose GPU Up to 128 processors Programming model Kernel-calling Explicit DMA transfer Widely available 26

27 CUDA Execution Model Grid-like execution Grid made of blocks Block made of threads Concurrent execution for thousand of threads 27

28 Prediction Accuracy As the number of concurrent processes increases, the accuracy decreases Especially for the CPU estimate-hist suffers from this loss of acuracy Keep some cores off the scheduling system Give them to other parts of the application and/or other processes Use prediction accuracy in the scheduling decision 28

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

GViM: GPU-accelerated Virtual Machines

GViM: GPU-accelerated Virtual Machines GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor

More information

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop

More information

Operating System Review Part

Operating System Review Part Operating System Review Part CMSC 602 Operating Systems Ju Wang, 2003 Fall Virginia Commonwealth University Review Outline Definition Memory Management Objective Paging Scheme Virtual Memory System and

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Heterogeneous Computing with a Fused CPU+GPU Device

Heterogeneous Computing with a Fused CPU+GPU Device with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and

More information

Intra Application Data Communication Characterization

Intra Application Data Communication Characterization Intra Application Data Communication Characterization Imran Ashraf, Vlad Mihai Sima, Koen Bertels Computer Engineering Lab, TU Delft, The Netherlands Trends Growing demand of processing Growing number

More information

Datacenter application interference

Datacenter application interference 1 Datacenter application interference CMPs (popular in datacenters) offer increased throughput and reduced power consumption They also increase resource sharing between applications, which can result in

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: SYSTEM ARCHITECTURE & SOFTWARE [CPU SCHEDULING] Shrideep Pallickara Computer Science Colorado State University OpenMP compiler directives

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

IN recent years, diminishing returns in single-core processor

IN recent years, diminishing returns in single-core processor IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 0, NO. 0, AUGUST 2016 1 AIRA: A Framework for Flexible Compute Kernel Execution in Heterogeneous Platforms Robert Lyerly, Alastair Murray, Antonio

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Adaptive Runtime Resource Management of Heterogeneous Resources

Adaptive Runtime Resource Management of Heterogeneous Resources Adaptive Runtime Resource Management of Heterogeneous Resources Roel Wuyts Principal Scientist, imec Professor, KUL (Distrinet) Carreer Overview Studies: Licentiaat Informatica (VUB, 1991-1995) 1995 001

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

The Case for Heterogeneous HTAP

The Case for Heterogeneous HTAP The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Multi-GPU Load Balancing for In-Situ Simulation and Visualization

Multi-GPU Load Balancing for In-Situ Simulation and Visualization Noname manuscript No. (will be inserted by the editor) Multi- Load Balancing for In-Situ Simulation and Visualization Yong Cao Robert Hagan Received: date / Accepted: date Abstract Multiple- systems have

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Inter-Block GPU Communication via Fast Barrier Synchronization

Inter-Block GPU Communication via Fast Barrier Synchronization CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

CHAPTER 2: PROCESS MANAGEMENT

CHAPTER 2: PROCESS MANAGEMENT 1 CHAPTER 2: PROCESS MANAGEMENT Slides by: Ms. Shree Jaswal TOPICS TO BE COVERED Process description: Process, Process States, Process Control Block (PCB), Threads, Thread management. Process Scheduling:

More information

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta, Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs USENIX

More information

A Cross-Input Adaptive Framework for GPU Program Optimizations

A Cross-Input Adaptive Framework for GPU Program Optimizations A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi nnouncements & Reading Material Exams

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Hardware Sizing Guide OV

Hardware Sizing Guide OV Hardware Sizing Guide OV3600 6.3 www.alcatel-lucent.com/enterprise Part Number: 0510620-01 Table of Contents Table of Contents... 2 Overview... 3 Properly Sizing Processing and for your OV3600 Server...

More information

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5) Announcements Reading Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) 1 Relationship between Kernel mod and User Mode User Process Kernel System Calls User Process

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5 OPERATING SYSTEMS CS3502 Spring 2018 Processor Scheduling Chapter 5 Goals of Processor Scheduling Scheduling is the sharing of the CPU among the processes in the ready queue The critical activities are:

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011 Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011 There are lots of GPUs 3 of top 5 supercomputers use GPUs In all

More information

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu Many-core Computing Can compilers and tools do the heavy lifting? Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT, IMPACT University of Illinois, Urbana-Champaign Outline Parallel application

More information

EE , GPU Programming

EE , GPU Programming EE 4702-1, GPU Programming When / Where Here (1218 Patrick F. Taylor Hall), MWF 11:30-12:20 Fall 2017 http://www.ece.lsu.edu/koppel/gpup/ Offered By David M. Koppelman Room 3316R Patrick F. Taylor Hall

More information

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information