International Conference on Computational Science (ICCS 2017)

Similar documents
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

MAGMA: a New Generation

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Fast and reliable linear system solutions on new parallel architectures

Parallel FFT Program Optimizations on Heterogeneous Computers

Some notes on efficient computing and high performance computing environments

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Modelling and optimisation of scientific software for multicore platforms

Mixed Precision Methods

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Document downloaded from: This paper must be cited as:

CafeGPI. Single-Sided Communication for Scalable Deep Learning

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Krishnan Suresh Associate Professor Mechanical Engineering

Parallelism paradigms

Introduction II. Overview

Modern GPUs (Graphics Processing Units)

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Hierarchical DAG Scheduling for Hybrid Distributed Systems

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

HPC with Multicore and GPUs

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Applications of Berkeley s Dwarfs on Nvidia GPUs

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A proposal for autotuning linear algebra routines on multicore platforms

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Accelerating the Iterative Linear Solver for Reservoir Simulation

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Parallelism on Hybrid Metaheuristics for Vector Autoregression Models

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

ME964 High Performance Computing for Engineering Applications

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Accelerating image registration on GPUs

Parallel Programming Multicore systems

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Solving Dense Linear Systems on Graphics Processors

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Intel Performance Libraries

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

School of Computer and Information Science

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Large-scale Deep Unsupervised Learning using Graphics Processors

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Computational Aspects and Recent Improvements in the Open-Source Multibody Analysis Software MBDyn

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Large scale Imaging on Current Many- Core Platforms

Trends in HPC (hardware complexity and software challenges)

CUDA. Matthew Joyner, Jeremy Williams

Behavioral Data Mining. Lecture 12 Machine Biology

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Introduction to Multicore Programming

Optimizing the operations with sparse matrices on Intel architecture

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Dealing with Asymmetry for Performance and Energy Efficiency

Optimisation Myths and Facts as Seen in Statistical Physics

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Contour Detection on Mobile Platforms

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

Intel Math Kernel Library

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

HPC Architectures. Types of resource currently in use

Strong Bridges and Strong Articulation Points of Directed Graphs

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Accelerating Financial Applications on the GPU

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

NEW ADVANCES IN GPU LINEAR ALGEBRA

Optimising the Mantevo benchmark suite for multi- and many-core architectures

*Yuta SAWA and Reiji SUDA The University of Tokyo

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Application Performance on Dual Processor Cluster Nodes

Quantum ESPRESSO on GPU accelerated systems

Advances of parallel computing. Kirill Bogachev May 2016

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

Transcription:

International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena 12-14 June, 2017 Conference title 1

Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 2

Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 3

Introduction Multibody systems (MBS): mechanical systems formed by rigid and flexible bodies which are connected by means of mechanical joins in such a way that there is relative movement between their bodies terminal handles The Stewart Platform platform ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 4

Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 5

Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 6

Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations exploits the topology of the MBS to reduce the dimension of the problem by relating the position of each body with respect to its preceding one ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 7

Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T0) (8) 12 dependent coordinates The Stewart Platform handle-stick (SG-H) (2-7) 15 dependent coordinates ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 9

Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 10

Motivation 1. A simulator for the computational kinematic analysis of MBS to allow us to analyze the efficiency of the group equations 2. A better exploitation of the computer resources by applying parallelism to reduce the executions in real-time applications terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 11

Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 12

Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 13

Parallelism in the Structural Groups method The Stewart Platform (MBS) is a case study to analyze the application of parallelism for speeding up the kinematic analysis based on Group equations terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 14

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 15

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 16

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for Parallelism can be exploited by simultaneously solving the problems for the SGs in the system, inside a multicore system (MKL) or with calls to GPU (MAGMA) tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 17

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for We have exploited the parallelism in different ways: 1. GEMKL: The multithreading version of MKL. 2. GEOMP+MKL: OpenMP is used to start the threads which works simultaneously in the solution of different SGs. The matrix problems for each group are solved by calling MKL, which can be sequential or multithreading 3. GEOMP+MA27: OpenMP parallelism is exploited, with calls to the routine MA27 for solution of the matrix problem 4. GEMAGMA: GPU parallelism is exploited by solving the matrix problems with MAGMA. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 18

Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 19

Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 20

Results CPU Intel Core i5-2400 3.10 GHz 4 cores No Hyper-Threading 16 GB RAM MKL GMKL and MA27 GMA27 (dense and sparse solvers) is used for the Global formulation GEOMP+MKL and GEOMP+MA27 is used for the Group equations (Topological formulation) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 21

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 Experiments: number of groups of the SP (nsg=6), nsg-t=12 and nsg-hs=15 for the smallest problem, and nsg-t is fixed to 24 and nsg-hs=30..720 for the other problems. Total size represents the size of the matrices for the global formulation. tend=200, dt=0.01, 20000 iterations, tend2=1 for the parallel algorithm for the G. E. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 22

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 Multithreading MKL is preferable to MA27 for small sizes not sparsity ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 23

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 The exploitation of sparsity through MA27 is advisable for large matrices low complexity cost of MA27 (O(n)) vs MKL (O(n 3 )) The best results are obtained with 3 OpenMP threads ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 24

Spped-up Results Global formulations vs Group Equations 9 GMA27/GMKL GMA27/GEOMP+MKL GMA27/GEOMP+MA27 8 7 6 5 4 3 2 1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 total size Speed-ups of parallel versions of GE in relation with the GF MA27 The GE method clearly outperforms the GF Speed-ups up to 8 for small sizes and closed to 4 for the largest problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 25

Results Application of parallelism for MBS larger than our case study CPU Intel Xeon E5-2620 2 GHz 2 hexa-cores = 12 cores 32 GB RAM MKL and MA27 sequential (dense and sparse solvers) for the Group equations GEOMP+MKL and GEOMP+MA27 is used for the Group equations nsg = 6, 16, 22 nsg-hs = 15, 30, 60, 120 nsg-t = 12 (nsg-hs=15) and 24 in other cases ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 26

Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA27 6 15 0.85 1.90 0.65 0.76 30 2.13 3.33 1.42 1.79 60 5.54 7.24 2.72 3.13 120 20.18 15.96 9.61 6.49 16 15 2.07 4.86 0.98 1.66 30 4.91 8.61 2.35 3.07 60 14.02 18.69 4.65 5.68 120 52.71 41.43 16.16 13.23 22 15 3.00 7.18 1.35 2.19 30 7.15 12.96 2.62 4.27 60 20.77 27.95 6.53 8.02 120 78.89 61.01 23.47 19.38 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 27

Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA27 6 15 0.85 1.90 0.65 0.76 30 2.13 3.33 1.42 1.79 60 5.54 7.24 2.72 3.13 120 20.18 15.96 9.61 6.49 16 15 2.07 4.86 0.98 1.66 30 4.91 8.61 2.35 3.07 60 14.02 18.69 4.65 5.68 120 52.71 41.43 16.16 13.23 22 15 3.00 7.18 1.35 2.19 30 7.15 12.96 2.62 4.27 60 20.77 27.95 6.53 8.02 120 78.89 61.01 23.47 19.38 The improvement increases with the number of groups and the number of coordinates. The exploitation of the sparsity is advantageous from between 60 and 120 coordinates. Two levels of parallelism can be used for a better exploitation of the parallelism in larger multicore systems. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 28

Results Raspberry Pi versus Multicore Systems Raspberry Pi: Small and cheap systems with low energy consumption The Stewart Platform Raspberry Pi 2 Model B (RP2) 4 cores ARMv7 32 bits Raspberry Pi 3 Model B (RP3) 4 cores ARMv8 64 bits MKL is not available LAPACK without multithreading is used for LAR CPU Intel Core i5-2400 3.10 GHz (MKL) 4 cores No Hyper-Threading 16 GB RAM CPU Intel Xeon E5-2620 2 GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 29

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 Number of groups: nsg=6 tend=300 tend2=200 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 30

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 TDP RP2=RP3=4W TDP i5=15w TDP E5=95W ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 33

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 Lowest TDP i5: Much low power consumption, slower RP: lowest TDP for the smallest size (SP) Competitive GP for control problems Low power consumption, price and size ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 34

Results Experiments on GPU CPU Intel Xeon E5-2620 2 GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM GPU GTX950 (MAGMA) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 35

Results Experiments on GPU Xeon OpenMPxMKL GPU Speed-up Speed-up nsg-t nsg-hs 6x2 2x6 MAGMA 6x2/2x6 6x2/MAGMA 12 18 0.006 0.01 0.331 0.6 0.02 24 36 0.008 0.013 0.336 0.62 0.02 36 54 0.01 0.021 0.373 0.48 0.03 48 72 0.025 0.057 0.417 0.44 0.06 60 90 0.034 0.067 0.492 0.51 0.07 72 108 0.047 0.077 0.563 0.61 0.08 400 400 0.721 0.697 3.638 1.03 0.2 600 600 1.819 2.023 5.995 0.9 0.3 800 800 3.211 4.399 8.808 0.73 0.36 1000 1000 6.928 7.376 13.47 0.94 0.51 2000 2000 43.728 40.588 53.038 1.08 0.82 3000 3000 143.072 115.185 138.463 1.24 1.03 4000 4000 328.249 253.709 293.713 1.29 1.12 GPU would be advantageous only for large problems Not advisable: low execution or power consumption. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 36

Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 37

Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 38

Conclusions The computational kimematic formulation based on group equations is a topological approach that exploits the kinematic structure of a MBS to divide it in several SGs of smaller sizes Parallel programming techniques can be applied to solve the equations independently The SP has been used to analyze the Group Equations formulation Lower execution times are obtained with the GE method in comparison with the global formulation Speed-ups achieved is between 4 and 8 Raspberry Pi: a good alternative to general purpose multicores for small control problems, with similar times and lower price, power consumption and space The massive parallelism of GPUs is appropriate for large problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 39

Future work The use of other computational libraries (dense and sparse) Auto-tuning techniques should be included in the routines the best parallel strategy and library with the values of some parameters (number of threads, number of steps) For large MBS the use of message-passing parallelism needs to be analyzed. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 40

International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena 12-14 June, 2017 Conference title 41