Parallel Systems. Project topics

Similar documents
General introduction: GPUs and the realm of parallel architectures

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Intel Xeon Phi Coprocessors

Hybrid Implementation of 3D Kirchhoff Migration

Modern Processor Architectures. L25: Modern Compiler Design

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Chapter 3 Parallel Software

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Parallel Programming Libraries and implementations

Parsing in Parallel on Multiple Cores and GPUs

Introduction II. Overview

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

CME 213 S PRING Eric Darve

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Early Experiences Writing Performance Portable OpenMP 4 Codes

CPU-GPU Heterogeneous Computing

CUDA GPGPU Workshop 2012

Data Parallel Algorithmic Skeletons with Accelerator Support

Accelerating sequential computer vision algorithms using commodity parallel hardware

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Parallel Programming. Libraries and Implementations

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC Architectures. Types of resource currently in use

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Parallel Computing. November 20, W.Homberg

Heterogeneous platforms

The Stampede is Coming: A New Petascale Resource for the Open Science Community

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Technology for a better society. hetcomp.com

High-Performance and Parallel Computing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Introduction to GPU hardware and to CUDA

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

OpenACC programming for GPGPUs: Rotor wake simulation

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Trends in HPC (hardware complexity and software challenges)

AUTOMATIC SMT THREADING

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Intel Knights Landing Hardware

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Parallel Programming for Graphics

The Era of Heterogeneous Computing

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

n N c CIni.o ewsrg.au

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Architecture, Programming and Performance of MIC Phi Coprocessor

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

CSC573: TSHA Introduction to Accelerators

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

SIMD Exploitation in (JIT) Compilers

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HPC code modernization with Intel development tools

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Addressing Heterogeneity in Manycore Applications

Introduction to Multicore Programming

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Runtime Systems

Case study: OpenMP-parallel sparse matrix-vector multiplication

Performance of deal.ii on a node

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

Overview of research activities Toward portability of performance

Code optimization in a 3D diffusion model

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CUDA. Matthew Joyner, Jeremy Williams

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

High performance computing and numerical modeling

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

A General Discussion on! Parallelism!

Accelerating image registration on GPUs

Cuda C Programming Guide Appendix C Table C-

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Serial and Parallel Sobel Filtering for multimedia applications

Optimized Scientific Computing:

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

HPC future trends from a science perspective

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Master Informatics Eng.

Transcription:

Parallel Systems Project topics 2016-2017

1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a natural way to find better solutions. The way the search space is traversed will be discussed in the forthcoming chapter on Discrete Optimization Problems (slides can be found on website). Parallelisation can happen on several levels (see slide). We will provide a real problem and discuss at which level the student will parallelize the code.

Automated warehouses: scheduling

Optimization of container barge routing scheduling

Parallelization Algorithms Levels of Parallelism Parallel Hardware Alternative algorithms Branch & Bound Iterated Local Search Alternative scenarios (due to uncertainty) Multi-search cluster multicore Tree-level parallelism accelerator Node-level parallelism (neighbor exploration) parallel calcution of score or bound Suitability is problem-dependent!

2. Image processing Image processing requires a lot of computational power (and also memory bandwidth). These algorithms are therefore first choices for parallelizations.

Burg chapter 3 convolutions Neighborhood operation

Examples of convolution Edge detection with sobel filter

Histogram Histogram calculations are essential parts of many algorithms. It is not completely straight forward to do this on GPU, but with some tricks, a good speedup can be achieved. Company On Semi is interested in a GPU implementation to calibrate and test the image sensors they are fabricating.

Image processing for dentists Nobel Biocare develop software for dentists and surgeons. They need to speed up their code to ensure a nice user experience. At the moment, they are stuck with algorithms based on convolutions (see earlier slides). Convolutions are perfect for GPUs! The following 2 slides show previous things we did with Nobel Biocare.

5.6 Marching cubes Requirements: - Speed up algorithm (MACH: seconds -> real-time). - Heterogenous CPU/GPU portability (Win/Mac). CT volume extraction isosurface

5.6 Distance map (distance field) Requirements: - Speed up algorithm (MACH: minutes -> seconds). - Heterogenous CPU/GPU portability (Win/Mac). 3D model image volume ray vs. triangle

The fastest determinant calculator in the world Calculating determinants take time. A student of last year successfully build a GPU implementation (and also a python version). Calculating a determinant is based on the determinants of sub matrices. His algorithm starts with calculating all determinants of all 2x2 sub matrices, then 3x3 and so on. It works! There are still some optimizations to be tested. We want to make it available for being used in Matlab. An application is security. They are looking for matrices (see next slide) for which no determinant of any sub matrix is zero. Our GPU implementation will help, because their matlab code is awfully slow!

4-dimensional matrix where a ij = a, b ij = b, we have 14

Test the Intel Xeon Phi A few years ago, Intel brought the Xeon Phi coprocessor on the market to compete with the huge processing power of GPUs. We have one at our department. A student could test it capabilities and compare it with GPUs. The main difference with GPUs is that you need to vectorize your code. This can be done manually or by compiler pragmas (see forthcoming slides).

Intel s Xeon Phi coprocessor Intel s response to GPUs 60 cores RAM ring network 10/17/2016 Accelerator technology 16

Intel s Xeon Phi s core Thread scheduler 4 hardware threads 512-bit Vector unit (SIMD) 10/17/2016 Accelerator technology 17

Usage of the coprocessor As MPI-node Off-load from main processor As standalone processor Common c-programming Pthreads Openmp Intel threading building blocks Accelerator technology 18

Vector processors (SIMD) 128-bit vector registers 7 8 2-1 3-3 5-7 + 10 5 7-8 Instructions can be performed at once on all elements of vector registers Has long be viewed as the solution for high-performance computing Why always repeating the same instructions (on different data)? => just apply the instruction immediately on all data However: difficult to program Is SIMT (OpenCL) a better alternative??

Vectorization needed for peak performance _m512 register _mm512_add operations 17/10/2016 20 Accelerator technology

Auto-vectorization 17/10/2016 21 Accelerator technology

SIMD pragma to indicate parallism 17/10/2016 22 Accelerator technology

Successful vectorization 17/10/2016 23 Accelerator technology

Test new features of OpenCL 2.0

Explore new features You need an OpenCL 2.0-enabled GPU (we can buy one) Check online Explore differences with OpenCL1.0 Test them, play with them: make small programs that show how they work Features Memory mapping (one address space CPU-GPU) Streams...

Microbenchmarks Our GPU research consists of developing small programs that test a specific feature of GPUs (e.g. memory bandwidth, double precision performance, overhead time for launching a kernel, etc). They are called microbenchmarks and can be found on: www.gpuperformance.org. We invite students to develop new microbenchmarks that we can add to our microbenchmark suite.