Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software

Size: px
Start display at page:

Download "Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software"

Transcription

1 Development Tools for Parallel Computing David Lecomber CTO, Allinea Software

2 Agenda Introduction What is HPC Bugs and Debugging Debugging parallel applications Challenges for the future

3 About Allinea Development tools company for HPC Flagship product Allinea DDT The most scalable debugger Now the leading debugger in parallel computing Record holder for debugging software on largest machines Production use at extreme scale and desktop Wide customer base Blue-chip engineering, government and academic research Strong collaborative relationships with customers and partners

4 What is HPC High Performance Computing Common aliases Simulation of some natural process/thing Intense number crunching: CPUs work flat out Historically distinct from data crunching Supercomputing Scientific computing Parallel computing Very large number of usually interrelated calculations: too big/slow for single machine Rarely real time today (but soon?) Examples Engineering - aerospace, automotive Sciences nuclear physics, molecular modelling, astrophysics Oil and gas reservoir modelling Medical modelling of human heart, neurology Climate modelling and weather forecasting

5 Parallel programming in HPC A world of pragmatists Scientists, academics, grad students, engineers Fortran, C++ One dominant standard library: MPI Many legacy codebases Distributed development Difficult to test scale, platforms,... Job launch and data transfer between machines Point to point communication (send, receive) and collective operations Single program, multiple data (SPMD) - multiple processes with separate memory Other models: OpenMP, PGAS languages Decades of parallel computing Problems naturally parallel although sometimes complex to partition

6 Parallel Programming Models Shared memory - OpenMP Pragmas to existing code Can be straightforward. #omp parallel for for (i = 0; i < n; i++) {. Data race conditions a potential problem Shared memory required Try it with gcc -fopenmp Distributed memory - MPI Distributed memory communication library MPI_Send send bytes to process N MPI_Recv receive bytes from MPI_Bcast broadcast from all. Around 200 functions many codes only use ~10 Free implementations eg. Open MPI, MPICH do not require a cluster/supercomputer

7 Example Code

8 The impact of multicore Cannot wait for faster processors to arrive Performance/capability leaps only via more parallelism Reluctant adopters of multicore but why? Existing codes are parallel Scalability (performance) often tails of as process counts rise ( weak scaling ) One survey of 188 supercomputing centres (IDC): Two strong oxen or 1,024 chickens? 8 Petaflops but near 10 Megawatts efficiency is important 52% of HPC applications run above 1 node 12% of HPC applications scale above 1,000 cores 1% of applications scale above 10,000 cores Software development is required to efficiently use more parallel resource

9 How extreme is it? Core count Growth in HPC core counts Average Cores Largest Smallest Year HPC core counts Core count Machine sizes are exploding Average Cores Smallest Skewed by largest machines but a common trend Largest system (Nov 2011) Japan 10 Petaflops UK's largest: 90,000 cores and 2/3rd of a Petaflop Easier to build a machine than it is to program it

10 HPC's current challenge GPUs a rival to traditional processors AMD and NVIDIA OpenCL, CUDA Great bang-for-bucks ratios A big challenge for HPC developers Data transfer Several memory levels Grid/block layout and thread scheduling Synchronization Tiny granularity often one thread per single calculation (SIMD) New languages, compilers, potential standards

11 Example GPU algorithm Matrix-matrix multiplication For C For CUDA Transfer whole matrix to device memory Read lines of A for block to shared memory Nested loops, ~4 lines of code in C Read columns of B for block to shared memory Synchronise Calculate output (loop) one output cell of C per GPU thread End kernel Write array back to host memory Recognizably C but... More complex More concurrent More buggy

12 A parallel hybrid world Hardware is determining the software Exploit concurrency within a multicore node: Shared memory via OpenMP, pthreads, To exploit GPUs: CUDA, OpenCL, For multiple nodes: MPI Result: Mixtures of paradigms Very large GPU systems now in service: Oak Ridge National Laboratory, Tennessee Titan (Cray XK6) 20,000 nodes - 299,008 CPU cores and 960 NVIDIA Tesla GPUs (and growing..) NUDT China Tianhe-1A Message passing, shared memory and GPU 86,038 CPU cores and 7168 NVIDIA Tesla GPUs $88M to build, $20M to run Many software rewrites are in progress because of GPUs Cost/performance vs codebase complexity and longevity development investment. but what do we do when software fails?

13 So how do we fix software? With Thousands of threads Millions of variables Terabytes of data How do you figure out what's going on with your code? Old tricks long dead: multiple terminals, print statements, Different from (eg.) web farms Everything is inter-related not independent We need to see all threads and processes together Different from most other fields? From embedded multicore? Only in scale (sometimes in terminology) Does it look like your problem? Does it look like your next problem?

14 Bugs in Practice

15 Some types of bug Some Terminology Bohr bug Heisenbug Vanishes when you try to debug (observe) Mandelbug Steady, dependable bug Complexity and obscurity of the cause is so great that it appears chaotic Schroedinbug First occurs after someone reads the source file and deduces that it never worked, after which the program ceases to work

16 How do we debug? A scientific process? Identify and reproduce the bug Hypothesis, trial and observation, understand how the code is behaving... Printf Command line debuggers Graphical debuggers Other options Static analysis Race detection with automated tools Valgrind Manual source code review Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.

17 The oldest debugger in the world All developers know printf As part of more general debugging messages Run binary in a debug mode to log behaviour Default for many web applications eg. HTTP access logs A form of post-mortem debugging In response to specific problem Insert instructions to code to test hypothesis At line 53, x becomes 7 and the if statement is executed Recompile, run, examine output Hypothesis incorrect? Loop back and try again Hypothesis correct? Remove debug output from binary, and fix the bug At scale? Interference with timing Interleaving of output can be misleading Flushing of output can be misleading Too much output

18 Real debuggers... Inspect the insides of an application whilst it is alive Inspect process state Control/observe execution Step line by line, function by function through an execution Stop at a line or function (breakpoint) Stop if a memory location changes Ideal to watch how a program is executed Process registers, and memory Variables and stacktraces (nesting of function calls) Less intrusive on the code than printf See exact line of crash unlike printf Test more hypotheses at a time Most well-known examples cater for single process debugging GDB, Visual Studio,...

19 Debugging Parallel Applications The same needs: observation, control,... More complex environment More complex problems No command prompt Printf unreliable No core files More processes More data More Heisenbugs Threading and communication introduce non-determinism

20 Allinea DDT in a nutshell Graphical source level debugger for Parallel, multi-threaded, scalar or hybrid code C, C++, F90, Co-Array Fortran, UPC Strong feature set Memory debugging Data analysis Managing concurrency Emphasizing differences Collective control Make as simple as possible, no more

21 Fixing everyday crashes Typical crash scenario: Too many to manually examine individually A good overview is important Threads/processes can be anywhere Allinea DDT merges stacks from processes and threads into a tree Leap to source for crashes Information scalably without overload Common fault patterns evident instantly Divergence, deadlock

22 Process Control Interacting with application progress is easy with DDT Step, breakpoint, play, or set data watchpoints based on groups Change interleaving order by stepping/playing selectively Group creation is easy Integrated throughout Allinea DDT eg. stack and data views

23 Simplifying data divergence Clear need to see data Too many variables to trawl manually Allinea DDT compares data automatically Smart highlighting Subtle hints for differences and changes New: Now with sparklines! More detailed analysis Full cross process comparison Historical values via tracepoints

24 Large Array Support Browse arrays 1, 2, 3, dimensions Table view Filtering Export Look for an outlier Save to a spreadsheet View arrays from multiple processes Search terabytes for rogue data in parallel

25 A simple parallel debugger A basic parallel debugger Aggregate scalar debuggers Implement support for many platforms and MPI implementations Develop user interface User Interface They work: good starting point Control asynchronously Simplify control and state display Controller Controller Debugger Debugger Process Process Initial architecture Scalar debuggers connect to user interface Direction connections - linear performance Any per-process item is an eventual bottleneck Operating system limitations I/O limitations File handles on the GUI Threads, processes Linear access counts on the best networked file systems are still linear Memory and computation limitations Machines still getting bigger...

26 Bug fixing at scale Can we reproduce at a smaller scale? Attempt to make problem happen on fewer nodes Often requires reduced data set the large one may not fit Does the bug even exist on smaller problems? Didn't you already try the code at small scale? Is it a system issue eg. an MPI problem? Is probability stacking up against you? Smaller data set may not trigger the problem Unlikely to spot on smaller runs without many many runs But near guaranteed to see it on a many-thousand core run Debugging at extreme scale is a necessity

27 How to make a Petascale debugger A control tree is the solution Ability to send bulk commands and merge responses Compact data type to represent sets of processes 100,000 processes in a depth 3 tree eg. For message envelopes An ordered tree of intervals? Or a bitmap? Develop aggregations Merge operations are key Not everything can merge losslessly Maintain the essence of the information eg. min, max, distribution

28 Time (Seconds) For Petascale and beyond DDT 3.0 Performance Figures ,000 All Step All Breakpoint 100, ,000 MPI Processes 200,000 Partnership with largest users DoE Oak Ridge National Laboratories LLNL, ANL, CEA and others High performance debugging - even at 220,000 cores Step all and display stacks: 0.1 seconds Logarithmic Usability is a Big Thing Scalable interface and features One million cores? waiting for the machine!

29 The Future Concurrency will increase 2012 or early 2013 DDT will debug a million core system International and national groups are preparing for Exascale: 100x more powerful than today's most powerful system Expected to be multi-level parallel (hybrid) Continued adoption of multicore and hybrid programming in consumer arena: Laptops, tablets, mobile phones

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Debugging HPC Applications. David Lecomber CTO, Allinea Software Debugging HPC Applications David Lecomber CTO, Allinea Software david@allinea.com Agenda Bugs and Debugging Debugging parallel applications Debugging OpenACC and other hybrid codes Debugging for Petascale

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software

Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software Debugging for the hybrid-multicore age (A HPC Perspective) David Lecomber CTO, Allinea Software david@allinea.com Agenda What is HPC? How is scale affecting HPC? Achieving tool scalability Scale in practice

More information

Development tools to enable Multicore

Development tools to enable Multicore Development tools to enable Multicore From the desktop to the extreme A perspective on multicore looking in from HPC David Lecomber CTO, Allinea Software david@allinea.com Introduction The Multicore Challenge

More information

Debugging at Scale Lindon Locks

Debugging at Scale Lindon Locks Debugging at Scale Lindon Locks llocks@allinea.com Debugging at Scale At scale debugging - from 100 cores to 250,000 Problems faced by developers on real systems Alternative approaches to debugging and

More information

Allinea Unified Environment

Allinea Unified Environment Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current

More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley bpaisley@allinea.com About Allinea Over 15 years of business focused on parallel programming development tools Strong

More information

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer ryan.hulguin@arm.com Agenda Introduction Overview of Allinea Products

More information

Understanding Dynamic Parallelism

Understanding Dynamic Parallelism Understanding Dynamic Parallelism Know your code and know yourself Presenter: Mark O Connor, VP Product Management Agenda Introduction and Background Fixing a Dynamic Parallelism Bug Understanding Dynamic

More information

Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze Vice President Sales & Marketing

Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze Vice President Sales & Marketing Welcomes PRACE/LinkSCEEM 2011 Winter School Jacques Philouze jacques@allinea.com Vice President Sales & Marketing Content Company Background Products in more depth Allinea OPT (Optimization and Profiling

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart ECMWF Workshop on High Performance Computing in Meteorology 3 rd November 2010 Dean Stewart Agenda Company Overview Rogue Wave Product Overview IMSL Fortran TotalView Debugger Acumem ThreadSpotter 1 Copyright

More information

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile

GPU Technology Conference Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile GPU Technology Conference 2015 Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile What do

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010 Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010 Chris Gottbrath Principal Product Manager Rogue Wave Major Product Offerings 2 TotalView Technologies Family

More information

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤 It s not my fault! Finding errors in parallel codes 找並行程序的錯誤 David Abramson Minh Dinh (UQ) Chao Jin (UQ) Research Computing Centre, University of Queensland, Brisbane Australia Luiz DeRose (Cray) Bob Moench

More information

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah Introduction to debugging Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu Overview Program errors Simple debugging Graphical debugging DDT and Totalview Intel tools

More information

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Guillimin HPC Users Meeting July 14, 2016

Guillimin HPC Users Meeting July 14, 2016 Guillimin HPC Users Meeting July 14, 2016 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Outline Compute Canada News System Status Software Updates Training

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Accelerate HPC Development with Allinea Performance Tools

Accelerate HPC Development with Allinea Performance Tools Accelerate HPC Development with Allinea Performance Tools 19 April 2016 VI-HPS, LRZ Florent Lebeau / Ryan Hulguin flebeau@allinea.com / rhulguin@allinea.com Agenda 09:00 09:15 Introduction 09:15 09:45

More information

Allinea DDT Debugger. Dan Mazur, McGill HPC March 5,

Allinea DDT Debugger. Dan Mazur, McGill HPC  March 5, Allinea DDT Debugger Dan Mazur, McGill HPC daniel.mazur@mcgill.ca guillimin@calculquebec.ca March 5, 2015 1 Outline Introduction and motivation Guillimin login and DDT configuration Compiling for a debugger

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Arm crossplatform. VI-HPS platform October 16, Arm Limited Arm crossplatform tools VI-HPS platform October 16, 2018 An introduction to Arm Arm is the world's leading semiconductor intellectual property supplier We license to over 350 partners: present in 95% of

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel Center for Information Services and High Performance Computing (ZIH) Welcome HRSK Practical on Debugging, 03.04.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Matthias Lieber (matthias.lieber@tu-dresden.de)

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Marc Tajchman* a a Commissariat à l énergie atomique

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

Debugging and Profiling

Debugging and Profiling Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

GPU. Ben de Waal Summer 2008

GPU. Ben de Waal Summer 2008 GPU Ben de Waal Summer 2008 Agenda Quick Roadmap A few observations And a few positions 2 GPUs are Great at Graphics Crysis 2006 Crytek / Electronic Arts Hellgate: London 2005-2006 Flagship 3 Studios,

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Present and Future Leadership Computers at OLCF

Present and Future Leadership Computers at OLCF Present and Future Leadership Computers at OLCF Al Geist ORNL Corporate Fellow DOE Data/Viz PI Meeting January 13-15, 2015 Walnut Creek, CA ORNL is managed by UT-Battelle for the US Department of Energy

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Debugging with GDB and DDT

Debugging with GDB and DDT Debugging with GDB and DDT Ramses van Zon SciNet HPC Consortium University of Toronto June 28, 2012 1/41 Ontario HPC Summerschool 2012 Central Edition: Toronto Outline Debugging Basics Debugging with the

More information

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors Debugging Programs Accelerated with Intel Xeon Phi Coprocessors A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com Debugging

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools Simplifying the Development and Debug of 8572-Based SMP Embedded Systems Wind River Workbench Development Tools Agenda Introducing multicore systems Debugging challenges of multicore systems Development

More information

Introduction to Parallel Computing!

Introduction to Parallel Computing! Introduction to Parallel Computing! SDSC Summer Institute! August 6-10, 2012 San Diego, CA! Rick Wagner! HPC Systems Manager! Purpose, Goals, Outline, etc.! Introduce broad concepts " Define terms " Explore

More information

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors Chris Gottbrath Rogue Wave Software Boulder, CO Chris.Gottbrath@roguewave.com Abstract Intel Xeon Phi coprocessors present

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

High performance computing and numerical modeling

High performance computing and numerical modeling High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic

More information

How to Write Code that Will Survive the Many-Core Revolution

How to Write Code that Will Survive the Many-Core Revolution How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European

More information

Good Practices in Parallel and Scientific Software. Gabriel Pedraza Ferreira

Good Practices in Parallel and Scientific Software. Gabriel Pedraza Ferreira Good Practices in Parallel and Scientific Software Gabriel Pedraza Ferreira gpedraza@uis.edu.co Parallel Software Development Research Universities Supercomputing Centers Oil & Gas 2004 Time Present CAE

More information

Debugging with GDB and DDT

Debugging with GDB and DDT Debugging with GDB and DDT Ramses van Zon SciNet HPC Consortium University of Toronto June 13, 2014 1/41 Ontario HPC Summerschool 2014 Central Edition: Toronto Outline Debugging Basics Debugging with the

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Scalable Debugging with TotalView on Blue Gene. John DelSignore, CTO TotalView Technologies

Scalable Debugging with TotalView on Blue Gene. John DelSignore, CTO TotalView Technologies Scalable Debugging with TotalView on Blue Gene John DelSignore, CTO TotalView Technologies Agenda TotalView on Blue Gene A little history Current status Recent TotalView improvements ReplayEngine (reverse

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Hybrid Model Parallel Programs

Hybrid Model Parallel Programs Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015 Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

CSE 374 Programming Concepts & Tools

CSE 374 Programming Concepts & Tools CSE 374 Programming Concepts & Tools Hal Perkins Fall 2017 Lecture 11 gdb and Debugging 1 Administrivia HW4 out now, due next Thursday, Oct. 26, 11 pm: C code and libraries. Some tools: gdb (debugger)

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

NightStar. NightView Source Level Debugger. Real-Time Linux Debugging and Analysis Tools BROCHURE

NightStar. NightView Source Level Debugger. Real-Time Linux Debugging and Analysis Tools BROCHURE NightStar Real-Time Linux Debugging and Analysis Tools Concurrent s NightStar is a powerful, integrated tool set for debugging and analyzing time-critical Linux applications. NightStar tools run with minimal

More information

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016 Introduction to Concurrent Software Systems CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016 1 Goals Present an overview of concurrency in software systems Review the benefits and challenges

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Debugging Your CUDA Applications With CUDA-GDB

Debugging Your CUDA Applications With CUDA-GDB Debugging Your CUDA Applications With CUDA-GDB Outline Introduction Installation & Usage Program Execution Control Thread Focus Program State Inspection Run-Time Error Detection Tips & Miscellaneous Notes

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

Typical Bugs in parallel Programs

Typical Bugs in parallel Programs Center for Information Services and High Performance Computing (ZIH) Typical Bugs in parallel Programs Parallel Programming Course, Dresden, 8.- 12. February 2016 Joachim Protze (protze@rz.rwth-aachen.de)

More information