International Conference on Computational Science (ICCS 2017)

Size: px
Start display at page:

Download "International Conference on Computational Science (ICCS 2017)"

Transcription

1 International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena June, 2017 Conference title 1

2 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 2

3 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 3

4 Introduction Multibody systems (MBS): mechanical systems formed by rigid and flexible bodies which are connected by means of mechanical joins in such a way that there is relative movement between their bodies terminal handles The Stewart Platform platform ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 4

5 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 5

6 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 6

7 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations exploits the topology of the MBS to reduce the dimension of the problem by relating the position of each body with respect to its preceding one ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 7

8 Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T0) (8) 12 dependent coordinates The Stewart Platform handle-stick (SG-H) (2-7) 15 dependent coordinates ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 9

9 Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 10

10 Motivation 1. A simulator for the computational kinematic analysis of MBS to allow us to analyze the efficiency of the group equations 2. A better exploitation of the computer resources by applying parallelism to reduce the executions in real-time applications terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 11

11 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 12

12 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 13

13 Parallelism in the Structural Groups method The Stewart Platform (MBS) is a case study to analyze the application of parallelism for speeding up the kinematic analysis based on Group equations terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 14

14 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 15

15 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 16

16 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for Parallelism can be exploited by simultaneously solving the problems for the SGs in the system, inside a multicore system (MKL) or with calls to GPU (MAGMA) tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 17

17 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for We have exploited the parallelism in different ways: 1. GEMKL: The multithreading version of MKL. 2. GEOMP+MKL: OpenMP is used to start the threads which works simultaneously in the solution of different SGs. The matrix problems for each group are solved by calling MKL, which can be sequential or multithreading 3. GEOMP+MA27: OpenMP parallelism is exploited, with calls to the routine MA27 for solution of the matrix problem 4. GEMAGMA: GPU parallelism is exploited by solving the matrix problems with MAGMA. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 18

18 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 19

19 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 20

20 Results CPU Intel Core i GHz 4 cores No Hyper-Threading 16 GB RAM MKL GMKL and MA27 GMA27 (dense and sparse solvers) is used for the Global formulation GEOMP+MKL and GEOMP+MA27 is used for the Group equations (Topological formulation) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 21

21 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x Experiments: number of groups of the SP (nsg=6), nsg-t=12 and nsg-hs=15 for the smallest problem, and nsg-t is fixed to 24 and nsg-hs= for the other problems. Total size represents the size of the matrices for the global formulation. tend=200, dt=0.01, iterations, tend2=1 for the parallel algorithm for the G. E. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 22

22 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x Multithreading MKL is preferable to MA27 for small sizes not sparsity ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 23

23 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x The exploitation of sparsity through MA27 is advisable for large matrices low complexity cost of MA27 (O(n)) vs MKL (O(n 3 )) The best results are obtained with 3 OpenMP threads ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 24

24 Spped-up Results Global formulations vs Group Equations 9 GMA27/GMKL GMA27/GEOMP+MKL GMA27/GEOMP+MA total size Speed-ups of parallel versions of GE in relation with the GF MA27 The GE method clearly outperforms the GF Speed-ups up to 8 for small sizes and closed to 4 for the largest problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 25

25 Results Application of parallelism for MBS larger than our case study CPU Intel Xeon E GHz 2 hexa-cores = 12 cores 32 GB RAM MKL and MA27 sequential (dense and sparse solvers) for the Group equations GEOMP+MKL and GEOMP+MA27 is used for the Group equations nsg = 6, 16, 22 nsg-hs = 15, 30, 60, 120 nsg-t = 12 (nsg-hs=15) and 24 in other cases ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 26

26 Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 27

27 Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA The improvement increases with the number of groups and the number of coordinates. The exploitation of the sparsity is advantageous from between 60 and 120 coordinates. Two levels of parallelism can be used for a better exploitation of the parallelism in larger multicore systems. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 28

28 Results Raspberry Pi versus Multicore Systems Raspberry Pi: Small and cheap systems with low energy consumption The Stewart Platform Raspberry Pi 2 Model B (RP2) 4 cores ARMv7 32 bits Raspberry Pi 3 Model B (RP3) 4 cores ARMv8 64 bits MKL is not available LAPACK without multithreading is used for LAR CPU Intel Core i GHz (MKL) 4 cores No Hyper-Threading 16 GB RAM CPU Intel Xeon E GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 29

29 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E Number of groups: nsg=6 tend=300 tend2=200 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 30

30 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E TDP RP2=RP3=4W TDP i5=15w TDP E5=95W ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 33

31 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E Lowest TDP i5: Much low power consumption, slower RP: lowest TDP for the smallest size (SP) Competitive GP for control problems Low power consumption, price and size ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 34

32 Results Experiments on GPU CPU Intel Xeon E GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM GPU GTX950 (MAGMA) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 35

33 Results Experiments on GPU Xeon OpenMPxMKL GPU Speed-up Speed-up nsg-t nsg-hs 6x2 2x6 MAGMA 6x2/2x6 6x2/MAGMA GPU would be advantageous only for large problems Not advisable: low execution or power consumption. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 36

34 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 37

35 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 38

36 Conclusions The computational kimematic formulation based on group equations is a topological approach that exploits the kinematic structure of a MBS to divide it in several SGs of smaller sizes Parallel programming techniques can be applied to solve the equations independently The SP has been used to analyze the Group Equations formulation Lower execution times are obtained with the GE method in comparison with the global formulation Speed-ups achieved is between 4 and 8 Raspberry Pi: a good alternative to general purpose multicores for small control problems, with similar times and lower price, power consumption and space The massive parallelism of GPUs is appropriate for large problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 39

37 Future work The use of other computational libraries (dense and sparse) Auto-tuning techniques should be included in the routines the best parallel strategy and library with the values of some parameters (number of threads, number of steps) For large MBS the use of message-passing parallelism needs to be analyzed. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 40

38 International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena June, 2017 Conference title 41

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum

More information

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Modelling and optimisation of scientific software for multicore platforms

Modelling and optimisation of scientific software for multicore platforms Modelling and optimisation of scientific software for multicore platforms Domingo Giménez... and the list of collaborators within the presentation Group page: http://www.um.es/pcgum Presentations: http://dis.um.es/%7edomingo/investigacion.html

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Document downloaded from: This paper must be cited as:

Document downloaded from:   This paper must be cited as: Document downloaded from: http://hdl.handle.net/1/ This paper must be cited as: Cámara, J.; Cuenca, J.; Giménez, D.; García, LP.; Vidal Maciá, AM. (01). Empirical Installation of Linear Algebra Shared-Memory

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation

A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Jose-Luis Torres-Moreno, Jose-Luis Blanco, Javier López-Martínez, Antonio

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

Krishnan Suresh Associate Professor Mechanical Engineering

Krishnan Suresh Associate Professor Mechanical Engineering Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA ~$4B revenue ~5,500 employees Founded in 1999 with primary business in

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

A proposal for autotuning linear algebra routines on multicore platforms

A proposal for autotuning linear algebra routines on multicore platforms Procedia Computer Science 001 (2010) (2012) 1 9 515 523 Procedia Computer Science www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 A proposal for autotuning

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Parallelism on Hybrid Metaheuristics for Vector Autoregression Models

Parallelism on Hybrid Metaheuristics for Vector Autoregression Models Parallelism on Hybrid Metaheuristics for Vector Autoregression Models Alfonso L. Castaño, Javier Cuenca, José-Matías Cutillas-Lozano, Domingo Giménez Scientific Computing and Parallel Programming Group,

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016 Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Large-scale Deep Unsupervised Learning using Graphics Processors

Large-scale Deep Unsupervised Learning using Graphics Processors Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn

Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn Pierangelo Masarati, Marco Morandini, Giuseppe Quaranta and Paolo Mantegazza Dipartimento di Ingegneria

More information

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003 New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Contour Detection on Mobile Platforms

Contour Detection on Mobile Platforms Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Strong Bridges and Strong Articulation Points of Directed Graphs

Strong Bridges and Strong Articulation Points of Directed Graphs Strong Bridges and Strong Articulation Points of Directed Graphs Giuseppe F. Italiano Univ. of Rome Tor Vergata Based on joint work with Donatella Firmani, Luigi Laura, Alessio Orlandi and Federico Santaroni

More information

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Samuel Cremer 1,2, Michel Bagein 1, Saïd Mahmoudi 1, Pierre Manneback 1 1 UMONS, University of Mons Computer Science

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures MAGMA QR, 1 GPU, All Available Cores Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee

More information