QUDA Programming For Staggered Quarks

Size: px
Start display at page:

Download "QUDA Programming For Staggered Quarks"

Transcription

1 QUDA Programming For Staggered Quarks Steven Gottlieb, Guochun Shi, Aaron Torok, Volodymyr Kindratenko National Center for Supercomputing Applications & Indiana University 1

2 Outline Background New staggered routines Benchmarks Production experience Future 2

3 Background NCSA s Innovative Systems Laboratory Worked on a Cell B.E. port of MILC, arxiv: Boston University: QUDA for Wilson type quarks, arxiv: , arxiv: /2009: sabbatical at NCSA We extend QUDA to staggered quarks 3

4 New Staggered Code First effort was staggered dslash for single GPU Extended to CG and then mulitmass CG Fat link computation Gauge Force Fermion Force Next step was to merge into BU QUDA code which had evolved from initial release 4

5 MultiGPU code MultiGPU inverter is now working can overlap communication with computation by using interior and exterior kernels or use a single kernel that is launched after message from neighbor is transferred to GPU 5

6 Differences between Wilson and staggered code: for improved staggered quarks need both fat links and long links (more memory) fat links are not unitary, so reconstruction not used (more memory, more time) for multigpu need to fetch from three planes of neighboring node 6

7 Benchmarks Guochun please check!! several systems available for benchmarking: GTX 280: on AC cluster at NCSA Tesla: FNAL, NERSC, JLab Fermi: NERSC 7

8 reconstruct CG(GF/s) multimass CG (GF/s) DP SP HP ^3 32 on GTX280. Four masses for last column. 8

9 standalone goal:assuming 100GB/s with PCIe overhead CG MM-CG Fat link Gauge Force Fermion Force All values in single precision GF/s. CG speeds measured for 500 iterations. 24^3 32 lattice. Use 12-reconstruct when possible. 9

10 Production Experience Have been using GPUs for electromagnetic effects, i.e., SU(3) U(1) So far only using ensembles that fit in single GPU Have analyzed several thousand configurations from 20^3 64 to 24^3 96 AC (NCSA), FNAL, Dirac (NERSC), JLab 10

11 Future Study of heavy-light mesons might use GPUs. Need to combine both Clover and staggered inverters. Although we have asqtad code modules for gauge configuration generation, now generating HISQ configurations, so new code must be added. Investigate strong scaling as supercomputers reaching for petaflop/s performance. Essential to decide what parts of production running can be profitably shifted to GPUs 11

12 Backup after this Following slides were shown in Tucson 12

13 8/17: I arrive at NCSA for sabbatical 8/19: Guochun Shi and I attend JLAB GPU workshop 8/31: Guochun compiles BU code on NCSA s AC cluster; I compile on my laptop early Sept.: start weekly GPU meeting 9/30: Guochun has staggered dslash 13

14 10/18: Guochun is playing with BU options like half precision and different memory arrangements 10/20: using texture memory helps performance 11/2: struggling with getting phases right in fat link reconstruction 11/10: Guochun reports that CG works for asqtad 11/17: multimass inverter working 14

15 11/19: join JLAB GPU mailing list. Goal is a multigpu API 12/7: Guochun working on fat link computation 1/11/10: fat link running at 156 GF 1/12/10: start discussion of gauge force calculation at weekly meeting 15

16 Current Performance single precision dslash: 85 GF without any texture reads multimass inverter (SP): 75 GF fat link: 156 GF 16

17 Next Steps Guochun is working on gauge force We are waiting for JLAB progress on multigpu API I want to try to use GPU inverters and fatlink routines for superscript 17

18 Additional Information NCSA GPU code is available via subversion from kfs3.ncsa.uiuc.edu/projects/ncsa_projects/ look in quda/quda, quda/staggered_quda, milc_qcd for BU code 18

arxiv: v1 [hep-lat] 1 Dec 2017

arxiv: v1 [hep-lat] 1 Dec 2017 arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics

More information

GPU Implementation of CG solver for MILC

GPU Implementation of CG solver for MILC GPU Implementation of CG solver for MILC Guochun Shi Innovative Systems Laboratory Lattice QCD: Solving the following linear system The wilson dslash operator BU code: dslash reference implementation

More information

arxiv: v1 [hep-lat] 13 Jun 2008

arxiv: v1 [hep-lat] 13 Jun 2008 Continuing Progress on a Lattice QCD Software Infrastructure arxiv:0806.2312v1 [hep-lat] 13 Jun 2008 Bálint Joó on behalf of the USQCD Collaboration Thomas Jefferson National Laboratory, 12000 Jefferson

More information

arxiv: v2 [hep-lat] 3 Nov 2016

arxiv: v2 [hep-lat] 3 Nov 2016 MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton

More information

MILC Code Basics. Carleton DeTar. First presented at Edinburgh EPCC HackLatt 2008 Updated HackLatt

MILC Code Basics. Carleton DeTar. First presented at Edinburgh EPCC HackLatt 2008 Updated HackLatt MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013 HackLatt 2008 1 MILC Code Capabilities Molecular dynamics evolution Staggered fermion actions (Asqtad, Fat7,

More information

Porting MILC to GPU: Lessons learned

Porting MILC to GPU: Lessons learned Porting MILC to GPU: Lessons learned Dylan Roeh Jonathan Troup Guochun Shi Volodymyr Kindratenko Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at

More information

Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions

Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions of Conjugate Gradient Algorithm with Staggered Fermions, Weonjong Lee Lattice Gauge Theory Research Center, FPRD, and CTP Department of Physics and Astronomy, Seoul National University, Seoul, 151-747,

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

MILC Code Basics. Carleton DeTar HackLatt HackLatt

MILC Code Basics. Carleton DeTar HackLatt HackLatt MILC Code Basics Carleton DeTar HackLatt 2008 HackLatt 2008 1 HackLatt 2008 2 Mathias Gruenewald: Temptation of St Anthony (1515) HackLatt 2008 3 MILC Code Capabilities Molecular dynamics evolution Staggered

More information

QDP++/Chroma on IBM PowerXCell 8i Processor

QDP++/Chroma on IBM PowerXCell 8i Processor QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) frank.winter@desy.de University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos,

More information

arxiv: v1 [hep-lat] 29 Oct 2008

arxiv: v1 [hep-lat] 29 Oct 2008 arxiv:0810.5365v1 [hep-lat] 29 Oct 2008 Kipton Barros E-mail: kbarros@bu.edu Ronald Babich E-mail: rbabich@bu.edu Richard Brower E-mail: brower@bu.edu Michael A. Clark Center for Computational Science,

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University Email: {wuxf, taylor}@cs.tamu.edu

More information

Adaptive Multigrid Algorithms for GPUs M Clark, NVIDIA with M Cheng and R Brower

Adaptive Multigrid Algorithms for GPUs M Clark, NVIDIA with M Cheng and R Brower Adaptive Multigrid Algorithms for GPUs M Clark, NVIDIA with M Cheng and R Brower Contents Adaptive Multigrid QUDA Library Implementation Extremely Preliminary Results Future Work Adaptive Multigrid 32

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs

QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs Frank Winter Jefferson Lab GPU Technology Conference 2013 March 18-21, San Jose, California Frank Winter (Jefferson Lab) QDP-JIT

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

STATE OF QUDA. M Clark NVIDIA USQCD SOFTWARE MEETING, JLAB APRIL

STATE OF QUDA. M Clark NVIDIA USQCD SOFTWARE MEETING, JLAB APRIL STATE OF QUDA USQCD SOFTWARE MEETING, JLAB M Clark NVIDIA TH 17 APRIL Contents NVIDIA update QUDA current Status Strong scaling Domain decomposition Multigrid CUDA Roadmap 20 18 Pascal Unified Memory 3D

More information

A Brief Introduction to Chroma. S. Collins, University of Regensburg

A Brief Introduction to Chroma. S. Collins, University of Regensburg A Brief Introduction to Chroma S. Collins, University of Regensburg Outline What is chroma? What can it do? Design. XML input and running chroma. How to get it. How to compile it. 2 Other chroma lectures

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs

QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs : A QDP++ Implementation for CUDA-Enabled GPUs, R. G. Edwards Thomas Jefferson National Accelerator Facility, 236 Newport News, VA E-mail: fwinter@jlab.org These proceedings describe briefly the framework

More information

QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible

QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/258105891 QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

Lattice QCD on Graphics Processing Units?

Lattice QCD on Graphics Processing Units? Lattice QCD on Graphics Processing Units? Zhaofeng Liu LPT, INRIA-Futurs(Orsay), PRISM(Versailles), IRISA/INRIA(Rennes), CAPS-Entreprise June 14, 2007 Outline Background Graphics Processing Units(GPU)

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi) A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 10/1/10 through 12/30/10 the NCSA Innovative Systems Lab team continued

More information

TESLA V100 PERFORMANCE GUIDE May 2018

TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 The Fastest and Most Productive GPU for AI and HPC Volta Architecture Tensor Core Improved NVLink & HBM2 Volta MPS Improved SIMT Model Most Productive GPU

More information

Bill Boroski LQCD-ext II Contractor Project Manager

Bill Boroski LQCD-ext II Contractor Project Manager Bill Boroski LQCD-ext II Contractor Project Manager boroski@fnal.gov Robert D. Kennedy LQCD-ext II Assoc. Contractor Project Manager kennedy@fnal.gov USQCD All-Hands Meeting Jefferson Lab April 28-29,

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

MILC Performance Benchmark and Profiling. April 2013

MILC Performance Benchmark and Profiling. April 2013 MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi) A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 01/01/11 through 03/31/11 the NCSA Innovative Systems Lab team investigated

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Performance Modeling for Systematic Performance Tuning

Performance Modeling for Systematic Performance Tuning Performance Modeling for Systematic Performance Tuning Torsten Hoefler with inputs from William Gropp, Marc Snir, Bill Kramer Invited Talk RWTH Aachen University March 30 th, Aachen, Germany All used images

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Claude Tadonki. Mines ParisTech Centre de Recherche Informatique

Claude Tadonki. Mines ParisTech Centre de Recherche Informatique Claude Tadonki Mines ParisTech Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Maths & Systems SEMINAR Mines ParisTech, Bd Saint-Michel Lattice Quantum ChromoDynamics (LQCD) Quantum

More information

PoS(LATTICE2014)028. The FUEL code project

PoS(LATTICE2014)028. The FUEL code project Argonne Leadership Computing Facility 9700 S. Cass Ave. Argonne, IL 60439, USA E-mail: osborn@alcf.anl.gov We give an introduction to the FUEL project for lattice field theory code. The code being developed

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

BNL FY17-18 Procurement

BNL FY17-18 Procurement BNL FY17-18 Procurement USQCD ll-hands Meeting JLB pril 28-29, 2017 Robert Mawhinney Columbia University Co Site rchitect - BNL 1 BGQ Computers at BNL USQCD half-rack (512 nodes) 2 racks RBRC 1 rack of

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Analysis and Visualization Tools for Lattice QCD

Analysis and Visualization Tools for Lattice QCD Analysis and Visualization Tools for Lattice QCD School of Computing - DePaul University - Chicago, IL - USA E-mail: mdipierro@cs.depaul.edu Yaoqian Zhong School of Computing - DePaul University - Chicago,

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Lattice QCD code Bridge++ on arithmetic accelerators

Lattice QCD code Bridge++ on arithmetic accelerators Lattice QCD code Bridge++ on arithmetic accelerators a, S. Aoki b, T. Aoyama c, K. Kanaya d,e, H. Matsufuru a, T. Miyamoto b, Y. Namekawa f, H. Nemura f, Y. Taniguchi d, S. Ueda g, and N. Ukita f a Computing

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

LQCD Facilities at Jefferson Lab. Chip Watson May 6, 2011

LQCD Facilities at Jefferson Lab. Chip Watson May 6, 2011 LQCD Facilities at Jefferson Lab Chip Watson May 6, 2011 Page 1 Outline Overview of hardware at Jefferson Lab GPU evolution since last year GPU R&D activities Operations Summary Page 2 CPU + Infiniband

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

FY17/FY18 Alternatives Analysis for the Lattice QCD Computing Project Extension II (LQCD-ext II)

FY17/FY18 Alternatives Analysis for the Lattice QCD Computing Project Extension II (LQCD-ext II) FY17/FY18 Alternatives Analysis for the Lattice QCD Computing Project Extension II (LQCD-ext II) Operated at Brookhaven National Laboratory Fermi National Accelerator Laboratory Thomas Jefferson National

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N. GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

arxiv: v1 [hep-lat] 1 Nov 2013

arxiv: v1 [hep-lat] 1 Nov 2013 arxiv:1311.0084v1 [hep-lat] 1 Nov 2013, Jun Noaki, Shoji Hashimoto, Takashi Kaneko KEK Theory Center, IPNS Tsukuba, Ibaraki 305-0810, Japan E-mail: cossu@post.kek.jp, noaki@post.kek.jp, shoji.hashimoto@kek.jp,

More information

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters Flexible Hardware Mapping for Finite Element Simulations on Hybrid /GPU Clusters Aaron Becker (abecker3@illinois.edu) Isaac Dooley Laxmikant Kale SAAHPC, July 30 2009 Champaign-Urbana, IL Target Application

More information

DWF SSE Interface Version 1.1.5

DWF SSE Interface Version 1.1.5 DWF SSE Interface Version 1.1.5 Andrew Pochinsky October 6, 2004 Abstract This is a definition of the interface to the SSE DWF CG solver to a Chroma-like environment. The CG engine requires gcc version

More information

GPU Acceleration of a. Theoretical Particle Physics Application

GPU Acceleration of a. Theoretical Particle Physics Application GPU Acceleration of a Theoretical Particle Physics Application Karthee Sivalingam August 27, 2010 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2010 Abstract Graphics

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance. Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance. Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Activities Lossless data compression Check sums Database appliance Image characterization

More information

The MILC Code. version

The MILC Code. version The MILC Code The MILC Code version 7.7.11 Alexei Bazavov (Brookhaven National Laboratory) Claude Bernard (Washington U.) Tom Burch (U. of Utah)

More information

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming

Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming Thomas Janson and Udo Kebschull Infrastructure and Computer Systems in Data Processing (IRI) Goethe University Frankfurt Germany

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project Cluster Computing Chip Watson Jefferson Lab High Performance Computing Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project Jie Chen, Ying Chen, Balint Joo, JLab HPC Group Distributed

More information

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern Chapter 3 Image Processing Methods The Role of Image Processing Methods (1) An image is an nxn matrix of gray or color values An image processing method is algorithm transforming such matrices or assigning

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

arxiv: v2 [hep-lat] 21 Nov 2018

arxiv: v2 [hep-lat] 21 Nov 2018 arxiv:1806.06043v2 [hep-lat] 21 Nov 2018 E-mail: j.m.o.rantaharju@swansea.ac.uk Ed Bennett E-mail: e.j.bennett@swansea.ac.uk Mark Dawson E-mail: mark.dawson@swansea.ac.uk Michele Mesiti E-mail: michele.mesiti@swansea.ac.uk

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction. at Internal Review of IT in the 12 GeV Era Mark Hall D May 20, 2011 Hall D in a Nutshell search for exotic mesons in the 1.5 to 2.0 GeV region 12 GeV electron beam coherent bremsstrahlung photon beam coherent

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview

More information

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL: 00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: http://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel

More information

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes arallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MI Datatypes Torsten Hoefler and Steven Gottlieb National Center for Supercomputing Applications University of Illinois

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters Craig Steffen csteffen@ncsa.uiuc.edu NCSA Innovative Systems Lab First International Green Computing Conference Workshop

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Implementation of Scientific. Cell Broadband Engine processor. Guochun Shi, Volodymyr Kindratenko

Implementation of Scientific. Cell Broadband Engine processor. Guochun Shi, Volodymyr Kindratenko Implementation of Scientific Computing Applications on the Cell Broadband Engine processor Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

Extreme I/O Scaling with HDF5

Extreme I/O Scaling with HDF5 Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of

More information

Accelerating K-Means Clustering with Parallel Implementations and GPU computing

Accelerating K-Means Clustering with Parallel Implementations and GPU computing Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani Electrical and Computer Engineering Dept. Northeastern University Boston, MA Email: bhimani@ece.neu.edu Miriam

More information