Automatic Tuning of the High Performance Linpack Benchmark

Size: px
Start display at page:

Download "Automatic Tuning of the High Performance Linpack Benchmark"

Transcription

1 Automatic Tuning of the High Performance Linpack Benchmark Ruowei Chen Supervisor: Dr. Peter Strazdins The Australian National University

2 What is the HPL Benchmark? World s Top 500 Supercomputers Table1. Top 5 in Rank Country Machine Processor Cores 1 China Tianhe-1A-NUDT THMPP Intel X5670 6C 2.93 GHZ 2 United States Jaguar Cray XT5-HE AMD OPTERON 6C 2.6GHZ 3 China Dawning TC3600 Intel X5650 Nvidia Tesla GPU 4 Japan TSUBAME 2.0 Intel G7 Xeon 6C 2.93GHZ 5 United States 186, , ,640 73,278 Hopper Cray XE6 12-core 2.1GHZ 153,408

3 HPL is doing simple things Solving a system of dense linear algebra equations with n-unknowns X * X 2 = X ( n = 3 ) ( Normally the n will be around 1,000 ~ 100,000+ ) Determine the performance of the computer system. Performance = N cal Time Where N cal = Number of effective floating point operations Time = Overall Time T calculation + T communication ( The communication overhead may be overlapped by calculations)

4 in a complex manner. HPL s algorithm: Two-dimensional block-cyclic data distribution - Rightlooking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swapbroadcast algorithm - backward substitution with lookahead of depth 1

5 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? HPL algorithm is changeable via 17 parameters: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration

6 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? HPL algorithm is changeable via 17 parameters: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration Considering all parameter (except N, P, Q): (Exhaustive Search) number of Combination = 1,045,094,400 running time on Jaguar (2 nd fastest) 10.3 hours (N=5000) 10,300 hours (N=50000)

7 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? Important: Not so Important: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration Considering all important parameters (except N, P, Q): (Exhaustive Search) number of Combination = 50,400 running time on Jaguar (2 nd fastest) 1.79 seconds (N=5000) 497,222 hours (N=5,000,000)

8 Automatic Tuning of the High Performance Linpack Benchmark Out: Exhaustive Search In: Random Linear (Linear-Random Hybrid) Nelder-Mead

9 Nelder-Mead B R B E B Yes W Reflection G or G Expansion Accept Reflection? B C2 B W G No C1 or S M G Contraction Shrinking Next Iteration Fig 1. One iteration of Nelder-Mead

10 Automatic Tuning of the High Performance Linpack Benchmark Automatic, but how? Fig 2. Structure of the Autotune Program * This program is written by Sotirios Diamand, Christopher Frazer, Samuel Rathmanner. They started this project in I ve made a few changes to this program.

11 Goals Achieve Highest Performance Cut down Tuning time save cpu time for other more important jobs possible to achieve higher performance

12 Gflops Tuning by Random Number of Configurations Fig 3. Performance of Random Tuner

13 Gflops Cut Down Tuning Time Early Termination n Non-ET ET - 5% ET - 10% ET - 20% n i n-i GFlops (2/3 (n 3 (n i) 3 ) + 3/2 (n 2 (n i) 2 ))/T e. Fig 4. The shape of the problem matrix when terminated early Fig 5. Accuracy of the early termination

14 GFlops time (sec) Cut Down Tuning Time Use a reasonable problem size time (sec) GFLOPS Problem Size: N Fig 5. Effect of the problem size on performance and tunning time

15 GFLOPS Profiling Block Size NB N=1000 N=2000 N=3000 N= NB Fig 6. Effect of the problem size N on profiling block size NB

16 Tuning Result Table 2. Tuning results of different Tuners Tuner Gflops Tuning time (sec) Efficiency (%) Linear Hybrid Nelder-Mead (Best) Random (5 configs) Random (10 configs) Random (15 configs) Random (20 configs) Random (25 configs)

17 Conclusion Early Termination helps to cut down tuning time. We don t need to use maximal problem size in tuning. Linear Search performs poorly suggests that there re interdependencies between parameters. Nelder-Mead performs well on HPL tuning. Random tuner works well, suggests that number of important parameters may be fewer. Future Work

Automatic Tuning of the High Performance Linpack Benchmark

Automatic Tuning of the High Performance Linpack Benchmark Automatic Tuning of the High Performance Linpack Benchmark Ruowei Chen Supervisor: Dr. Peter Strazdins The final report for Comp8750 Computer System Project submitted in partial fulfillment of the degree

More information

Benchmark Results. 2006/10/03

Benchmark Results. 2006/10/03 Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

High Performance Linpack Benchmark on AMD EPYC Processors

High Performance Linpack Benchmark on AMD EPYC Processors High Performance Linpack Benchmark on AMD EPYC Processors This document details running the High Performance Linpack (HPL) benchmark using the AMD xhpl binary. HPL Implementation: The HPL benchmark presents

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing On the Efficacy of a Fued CPU+GPU Proceor (or APU) for Parallel Computing Mayank Daga, Ahwin M. Aji, and Wu-chun Feng Dept. of Computer Science Sampling of field that ue GPU Mac OS X Comology Molecular

More information

Redbooks Paper. Tuning tips for HPL on IBM xseries Linux Clusters. The author of this Redpaper. Dr. Swamy N. Kandadai

Redbooks Paper. Tuning tips for HPL on IBM xseries Linux Clusters. The author of this Redpaper. Dr. Swamy N. Kandadai Redbooks Paper Dr. Swamy N. Kandadai Tuning tips for HPL on IBM xseries Linux Clusters Supercomputing was once reserved for specially-made high performance, high cost machines. But now it is being done

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Jack Dongarra University of Tennessee Oak Ridge National Laboratory Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

DELL POWEREDGE C4130 & INTEL XEON PHI COPROCESSOR 7120P

DELL POWEREDGE C4130 & INTEL XEON PHI COPROCESSOR 7120P DELL POWEREDGE C4130 & INTEL XEON PHI COPROCESSOR 7120P For organizations running high-performance computing (HPC) workloads, strong computational performance means getting results faster. In recent years,

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Fabio AFFINITO.

Fabio AFFINITO. Introduction to High Performance Computing Fabio AFFINITO What is the meaning of High Performance Computing? What does HIGH PERFORMANCE mean??? 1976... Cray-1 supercomputer First commercial successful

More information

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes D. Akhmedov, S. Yelubayev, T. Bopeyev, F. Abdoldina, D. Muratov, R.

More information

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

HPC Technology Update Challenges or Chances?

HPC Technology Update Challenges or Chances? HPC Technology Update Challenges or Chances? Swiss Distributed Computing Day Thomas Schoenemeyer, Technology Integration, CSCS 1 Move in Feb-April 2012 1500m2 16 MW Lake-water cooling PUE 1.2 New Datacenter

More information

An approach to provide remote access to GPU computational power

An approach to provide remote access to GPU computational power An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems

Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems Piotr Luszczek August 3, 2001 Benchmark Matrix Optimizations Parallel name dimension allowed Processing 100 100 complier No 1000 1000 manual No

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters*

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters* A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous - Clusters* Rong Shi 1, Sreeram Potluri 1, Khaled Hamidouche 1, Xiaoyi Lu 1, Karen Tomko 2, and Dhabaleswar K. (DK) Panda 1 1

More information

Accelerating Linpack with CUDA on heterogeneous clusters

Accelerating Linpack with CUDA on heterogeneous clusters Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

2014 China TOP100 List of High Performance Computer

2014 China TOP100 List of High Performance Computer 2014 China TOP100 List of High Performance Computer (November, 2014) 1/ 15 2014 China TOP100 List of High Performance Computer The Specialty Association of Mathematical & Scientific Software, CSIA Evaluation

More information

Performance Benchmark and MPI Evaluation Using Westmere-based Infiniband HPC Cluster

Performance Benchmark and MPI Evaluation Using Westmere-based Infiniband HPC Cluster Performance Benchmark and MPI Evaluation Using Westmere-based Infiniband HPC Cluster Basem Madani, Raed Al-Shaikh Department of Computer Engineering King Fahd University of Petroleum and Minerals Dhahran

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order

More information

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1, Mary W. Hall 2, Jacqueline Chame 3, Chun Chen 2, Paul D. Hovland 1 1 ANL/MCS 2 University

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Comparison of Parallel Processing Systems. Motivation

Comparison of Parallel Processing Systems. Motivation Comparison of Parallel Processing Systems Ash Dean Katie Willis CS 67 George Mason University Motivation Increasingly, corporate and academic projects require more computing power than a typical PC can

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Digital Signal Processor Supercomputing

Digital Signal Processor Supercomputing Digital Signal Processor Supercomputing ENCM 515: Individual Report Prepared by Steven Rahn Submitted: November 29, 2013 Abstract: Analyzing the history of supercomputers: how the industry arrived to where

More information

TSUBAME-KFC : Ultra Green Supercomputing Testbed

TSUBAME-KFC : Ultra Green Supercomputing Testbed TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,

More information

TFLOP Performance for ANSYS Mechanical

TFLOP Performance for ANSYS Mechanical TFLOP Performance for ANSYS Mechanical Dr. Herbert Güttler Engineering GmbH Holunderweg 8 89182 Bernstadt www.microconsult-engineering.de Engineering H. Güttler 19.06.2013 Seite 1 May 2009, Ansys12, 512

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Technical Report, CALDGEMM and HPL

Technical Report, CALDGEMM and HPL Technical Report, CALDGEMM and HPL David Rohr, Matthias Kretz, Matthias Bach Frankfurt Institute for Advanced Studies, University of Frankfurt, Germany December 9, 21 Abstract The LOEWE-CSC cluster at

More information

Underutilizing Resources for HPC on Clouds

Underutilizing Resources for HPC on Clouds Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-21/6-1 1/June/21 Underutilizing Resources for HPC on Clouds R. Iakymchuk, J. Napper and P. Bientinesi Financial

More information

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan Abstract Heterogeneous supercomputers with combined general

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Auto-tuning Multigrid with PetaBricks

Auto-tuning Multigrid with PetaBricks Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

Benchmarking a Large-Scale Heterogeneous Cluster

Benchmarking a Large-Scale Heterogeneous Cluster Benchmarking a Large-Scale Heterogeneous Cluster Daniel Loreto, Erik Nordlander, and Adam Oliner Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology, Cambridge,

More information

Half Precision Benchmarking for HPC

Half Precision Benchmarking for HPC PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 1 / 18 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 2 / 18 Major Floating

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Dense LU Factorization

Dense LU Factorization Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Confessions of an Accidental Benchmarker

Confessions of an Accidental Benchmarker Confessions of an Accidental Benchmarker http://bit.ly/hpcg-benchmark 1 Appendix B of the Linpack Users Guide Designed to help users extrapolate execution Linpack software package First benchmark report

More information

Habanero Operating Committee. January

Habanero Operating Committee. January Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes 3. Storage 4. Network Execute Nodes Type Quantity Standard 176 High Memory 32 GPU* 14 Total 222 Execute Nodes

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing

High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen Colorado School of Mines Golden, CO, USA

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

ECE 574 Cluster Computing Lecture 4

ECE 574 Cluster Computing Lecture 4 ECE 574 Cluster Computing Lecture 4 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 31 January 2017 Announcements Don t forget about homework #3 I ran HPCG benchmark on Haswell-EP

More information

Quantitative study of computing time of direct/iterative solver for MoM by GPU computing

Quantitative study of computing time of direct/iterative solver for MoM by GPU computing Quantitative study of computing time of direct/iterative solver for MoM by GPU computing Keisuke Konno 1a), Hajime Katsuda 2, Kei Yokokawa 1, Qiang Chen 1, Kunio Sawaya 3, and Qiaowei Yuan 4 1 Department

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

On the Performance of Simple Parallel Computer of Four PCs Cluster

On the Performance of Simple Parallel Computer of Four PCs Cluster On the Performance of Simple Parallel Computer of Four PCs Cluster H. K. Dipojono and H. Zulhaidi High Performance Computing Laboratory Department of Engineering Physics Institute of Technology Bandung

More information

ECE 574 Cluster Computing Lecture 2

ECE 574 Cluster Computing Lecture 2 ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019 Announcements Put your name on HW#1 before turning in! 1 Top500 List November

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter CUG 2011, May 25th, 2011 1 Requirements to Reality Develop RFP Select

More information

Emerging Heterogeneous Technologies for High Performance Computing

Emerging Heterogeneous Technologies for High Performance Computing MURPA (Monash Undergraduate Research Projects Abroad) Emerging Heterogeneous Technologies for High Performance Computing Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester

More information

LINPACK Benchmark Optimizations on a Virtual Processor Grid

LINPACK Benchmark Optimizations on a Virtual Processor Grid Ed Anderson (eanderson@na-net.ornl.gov) Maynard Brandt (retired) Chao Yang (cwy@cray.com) Outline Organization of the Cray LINPACK Benchmark code Kernel optimizations on the CRAY X The virtual processor

More information