International Conference on Computational Science (ICCS 2017)
|
|
- Ellen Reynolds
- 5 years ago
- Views:
Transcription
1 International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena June, 2017 Conference title 1
2 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 2
3 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 3
4 Introduction Multibody systems (MBS): mechanical systems formed by rigid and flexible bodies which are connected by means of mechanical joins in such a way that there is relative movement between their bodies terminal handles The Stewart Platform platform ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 4
5 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 5
6 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 6
7 Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations exploits the topology of the MBS to reduce the dimension of the problem by relating the position of each body with respect to its preceding one ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 7
8 Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T0) (8) 12 dependent coordinates The Stewart Platform handle-stick (SG-H) (2-7) 15 dependent coordinates ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 9
9 Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 10
10 Motivation 1. A simulator for the computational kinematic analysis of MBS to allow us to analyze the efficiency of the group equations 2. A better exploitation of the computer resources by applying parallelism to reduce the executions in real-time applications terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 11
11 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 12
12 Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 13
13 Parallelism in the Structural Groups method The Stewart Platform (MBS) is a case study to analyze the application of parallelism for speeding up the kinematic analysis based on Group equations terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 14
14 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 15
15 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 16
16 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for Parallelism can be exploited by simultaneously solving the problems for the SGs in the system, inside a multicore system (MKL) or with calls to GPU (MAGMA) tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 17
17 Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for We have exploited the parallelism in different ways: 1. GEMKL: The multithreading version of MKL. 2. GEOMP+MKL: OpenMP is used to start the threads which works simultaneously in the solution of different SGs. The matrix problems for each group are solved by calling MKL, which can be sequential or multithreading 3. GEOMP+MA27: OpenMP parallelism is exploited, with calls to the routine MA27 for solution of the matrix problem 4. GEMAGMA: GPU parallelism is exploited by solving the matrix problems with MAGMA. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 18
18 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 19
19 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 20
20 Results CPU Intel Core i GHz 4 cores No Hyper-Threading 16 GB RAM MKL GMKL and MA27 GMA27 (dense and sparse solvers) is used for the Global formulation GEOMP+MKL and GEOMP+MA27 is used for the Group equations (Topological formulation) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 21
21 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x Experiments: number of groups of the SP (nsg=6), nsg-t=12 and nsg-hs=15 for the smallest problem, and nsg-t is fixed to 24 and nsg-hs= for the other problems. Total size represents the size of the matrices for the global formulation. tend=200, dt=0.01, iterations, tend2=1 for the parallel algorithm for the G. E. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 22
22 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x Multithreading MKL is preferable to MA27 for small sizes not sparsity ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 23
23 Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th x x x x x x x x The exploitation of sparsity through MA27 is advisable for large matrices low complexity cost of MA27 (O(n)) vs MKL (O(n 3 )) The best results are obtained with 3 OpenMP threads ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 24
24 Spped-up Results Global formulations vs Group Equations 9 GMA27/GMKL GMA27/GEOMP+MKL GMA27/GEOMP+MA total size Speed-ups of parallel versions of GE in relation with the GF MA27 The GE method clearly outperforms the GF Speed-ups up to 8 for small sizes and closed to 4 for the largest problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 25
25 Results Application of parallelism for MBS larger than our case study CPU Intel Xeon E GHz 2 hexa-cores = 12 cores 32 GB RAM MKL and MA27 sequential (dense and sparse solvers) for the Group equations GEOMP+MKL and GEOMP+MA27 is used for the Group equations nsg = 6, 16, 22 nsg-hs = 15, 30, 60, 120 nsg-t = 12 (nsg-hs=15) and 24 in other cases ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 26
26 Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 27
27 Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA The improvement increases with the number of groups and the number of coordinates. The exploitation of the sparsity is advantageous from between 60 and 120 coordinates. Two levels of parallelism can be used for a better exploitation of the parallelism in larger multicore systems. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 28
28 Results Raspberry Pi versus Multicore Systems Raspberry Pi: Small and cheap systems with low energy consumption The Stewart Platform Raspberry Pi 2 Model B (RP2) 4 cores ARMv7 32 bits Raspberry Pi 3 Model B (RP3) 4 cores ARMv8 64 bits MKL is not available LAPACK without multithreading is used for LAR CPU Intel Core i GHz (MKL) 4 cores No Hyper-Threading 16 GB RAM CPU Intel Xeon E GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 29
29 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E Number of groups: nsg=6 tend=300 tend2=200 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 30
30 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E TDP RP2=RP3=4W TDP i5=15w TDP E5=95W ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 33
31 Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E Lowest TDP i5: Much low power consumption, slower RP: lowest TDP for the smallest size (SP) Competitive GP for control problems Low power consumption, price and size ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 34
32 Results Experiments on GPU CPU Intel Xeon E GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM GPU GTX950 (MAGMA) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 35
33 Results Experiments on GPU Xeon OpenMPxMKL GPU Speed-up Speed-up nsg-t nsg-hs 6x2 2x6 MAGMA 6x2/2x6 6x2/MAGMA GPU would be advantageous only for large problems Not advisable: low execution or power consumption. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 36
34 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 37
35 Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 38
36 Conclusions The computational kimematic formulation based on group equations is a topological approach that exploits the kinematic structure of a MBS to divide it in several SGs of smaller sizes Parallel programming techniques can be applied to solve the equations independently The SP has been used to analyze the Group Equations formulation Lower execution times are obtained with the GE method in comparison with the global formulation Speed-ups achieved is between 4 and 8 Raspberry Pi: a good alternative to general purpose multicores for small control problems, with similar times and lower price, power consumption and space The massive parallelism of GPUs is appropriate for large problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 39
37 Future work The use of other computational libraries (dense and sparse) Auto-tuning techniques should be included in the routines the best parallel strategy and library with the values of some parameters (number of threads, number of steps) For large MBS the use of message-passing parallelism needs to be analyzed. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 40
38 International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena June, 2017 Conference title 41
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationEmpirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms
Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationPARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures
PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationModelling and optimisation of scientific software for multicore platforms
Modelling and optimisation of scientific software for multicore platforms Domingo Giménez... and the list of collaborators within the presentation Group page: http://www.um.es/pcgum Presentations: http://dis.um.es/%7edomingo/investigacion.html
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationDocument downloaded from: This paper must be cited as:
Document downloaded from: http://hdl.handle.net/1/ This paper must be cited as: Cámara, J.; Cuenca, J.; Giménez, D.; García, LP.; Vidal Maciá, AM. (01). Empirical Installation of Linear Algebra Shared-Memory
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationA comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation
A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Jose-Luis Torres-Moreno, Jose-Luis Blanco, Javier López-Martínez, Antonio
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationVery fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards
Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationKrishnan Suresh Associate Professor Mechanical Engineering
Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationSparse LU Factorization for Parallel Circuit Simulation on GPUs
Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationStan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA
Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA ~$4B revenue ~5,500 employees Founded in 1999 with primary business in
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationA proposal for autotuning linear algebra routines on multicore platforms
Procedia Computer Science 001 (2010) (2012) 1 9 515 523 Procedia Computer Science www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 A proposal for autotuning
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationAccelerating the Iterative Linear Solver for Reservoir Simulation
Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationParallelism on Hybrid Metaheuristics for Vector Autoregression Models
Parallelism on Hybrid Metaheuristics for Vector Autoregression Models Alfonso L. Castaño, Javier Cuenca, José-Matías Cutillas-Lozano, Domingo Giménez Scientific Computing and Parallel Programming Group,
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationDatenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016
Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationMUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA
The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationLarge-scale Deep Unsupervised Learning using Graphics Processors
Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationComputational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn
Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn Pierangelo Masarati, Marco Morandini, Giuseppe Quaranta and Paolo Mantegazza Dipartimento di Ingegneria
More informationAlex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003
New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationBehavioral Data Mining. Lecture 12 Machine Biology
Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationMAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel
MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More informationA class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationAlgorithms, System and Data Centre Optimisation for Energy Efficient HPC
2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationStrong Bridges and Strong Articulation Points of Directed Graphs
Strong Bridges and Strong Articulation Points of Directed Graphs Giuseppe F. Italiano Univ. of Rome Tor Vergata Based on joint work with Donatella Firmani, Luigi Laura, Alessio Orlandi and Federico Santaroni
More informationSoftware within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual
Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationImproving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine
Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Samuel Cremer 1,2, Michel Bagein 1, Saïd Mahmoudi 1, Pierre Manneback 1 1 UMONS, University of Mons Computer Science
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationMAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures MAGMA QR, 1 GPU, All Available Cores Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee
More information