International Conference on Computational Science (ICCS 2017)

International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A. Flores, D. Giménez, M. Saura-Sánchez Ω and P. Segado-Cabezos Ω Computer Engineering Department, University of Murcia Computer Science and Systems Department, University of Murcia Ω Mechanical Engineering, Technical University of Cartagena 12-14 June, 2017 Conference title 1

Outline Introduction and Motivation Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 2

Introduction Multibody systems (MBS): mechanical systems formed by rigid and flexible bodies which are connected by means of mechanical joins in such a way that there is relative movement between their bodies terminal handles The Stewart Platform platform ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 4

Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 6

Introduction The study of the relationships between the bodies is known as kinematic modeling Selects a vector q of coordinates to define the position and orientation of each body of the MBS in the space Coordinates are related by means a nonlinear systems of constrainst equations Φ(q) = 0 Global formulations Topogical formulations exploits the topology of the MBS to reduce the dimension of the problem by relating the position of each body with respect to its preceding one ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 7

Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T0) (8) 12 dependent coordinates The Stewart Platform handle-stick (SG-H) (2-7) 15 dependent coordinates ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 9

Introduction Structural Analysis: splits the MBS into Structural Groups (SGs) Kinematic Structure: How many SG, kind & order terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 10

Motivation 1. A simulator for the computational kinematic analysis of MBS to allow us to analyze the efficiency of the group equations 2. A better exploitation of the computer resources by applying parallelism to reduce the executions in real-time applications terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 11

Parallelism in the Structural Groups method The Stewart Platform (MBS) is a case study to analyze the application of parallelism for speeding up the kinematic analysis based on Group equations terminal (SG-T) (8) handle-stick (SG-H) (2-7) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 14

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 16

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for Parallelism can be exploited by simultaneously solving the problems for the SGs in the system, inside a multicore system (MKL) or with calls to GPU (MAGMA) tend: a maximum execution time is established dt: time step tend2: number of iterations for the position problem nsg: number of structural groups nsg-t: dimension of the SG-T matrix nsg-hs: dimension of the SG-HS matrix ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 17

Parallelism in the Structural Groups method A scheme of the Group Equations method 1 for number of external iterations (tend*dt) do 2 Solve kinematic of terminal (size nsg-t) //MKL p. 3 for all structural components (nsg) do //OpenMP p. 4 for number of internal iterations (tend2) do 5 Solve kinematic of SC (size nsg-hs) //MKL p. 6 end for 7 end for 8 end for We have exploited the parallelism in different ways: 1. GEMKL: The multithreading version of MKL. 2. GEOMP+MKL: OpenMP is used to start the threads which works simultaneously in the solution of different SGs. The matrix problems for each group are solved by calling MKL, which can be sequential or multithreading 3. GEOMP+MA27: OpenMP parallelism is exploited, with calls to the routine MA27 for solution of the matrix problem 4. GEMAGMA: GPU parallelism is exploited by solving the matrix problems with MAGMA. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 18

Outline Introduction Parallelism in the Structural Groups method Results Conclusions and Future work ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 19

Results CPU Intel Core i5-2400 3.10 GHz 4 cores No Hyper-Threading 16 GB RAM MKL GMKL and MA27 GMA27 (dense and sparse solvers) is used for the Global formulation GEOMP+MKL and GEOMP+MA27 is used for the Group equations (Topological formulation) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 21

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 Experiments: number of groups of the SP (nsg=6), nsg-t=12 and nsg-hs=15 for the smallest problem, and nsg-t is fixed to 24 and nsg-hs=30..720 for the other problems. Total size represents the size of the matrices for the global formulation. tend=200, dt=0.01, 20000 iterations, tend2=1 for the parallel algorithm for the G. E. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 22

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 Multithreading MKL is preferable to MA27 for small sizes not sparsity ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 23

Results Global formulations vs Group Equations Total size nsg-hs GMKL GMA27 GEOMP+MKL GEOMP+MA27 time th. time time th. x th. time th 102 15 2.52 1 4.19 0.53 3x1 0.74 3 204 30 9.11 3 9.02 1.13 3x1 1.40 3 384 60 32.43 3 13.45 2.39 3x1 2.84 3 744 120 151.94 4 25.29 8.40 3x1 5.71 3 1464 240 1127.25 4 48.26 42.33 3x1 12.23 3 2184 360 3579.63 4 71.27 147.41 3x1 18.23 3 2904 480 7149.04 4 94.19 323.77 3x1 24.35 3 4344 720 22840.51 4 140.13 828.97 1x4 36.59 3 The exploitation of sparsity through MA27 is advisable for large matrices low complexity cost of MA27 (O(n)) vs MKL (O(n 3 )) The best results are obtained with 3 OpenMP threads ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 24

Spped-up Results Global formulations vs Group Equations 9 GMA27/GMKL GMA27/GEOMP+MKL GMA27/GEOMP+MA27 8 7 6 5 4 3 2 1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 total size Speed-ups of parallel versions of GE in relation with the GF MA27 The GE method clearly outperforms the GF Speed-ups up to 8 for small sizes and closed to 4 for the largest problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 25

Results Application of parallelism for MBS larger than our case study CPU Intel Xeon E5-2620 2 GHz 2 hexa-cores = 12 cores 32 GB RAM MKL and MA27 sequential (dense and sparse solvers) for the Group equations GEOMP+MKL and GEOMP+MA27 is used for the Group equations nsg = 6, 16, 22 nsg-hs = 15, 30, 60, 120 nsg-t = 12 (nsg-hs=15) and 24 in other cases ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 26

Results Application of parallelism for MBS larger than our case study nsg nsg-hs MKL MA27 GEOMP+MKL GEOMP+MA27 6 15 0.85 1.90 0.65 0.76 30 2.13 3.33 1.42 1.79 60 5.54 7.24 2.72 3.13 120 20.18 15.96 9.61 6.49 16 15 2.07 4.86 0.98 1.66 30 4.91 8.61 2.35 3.07 60 14.02 18.69 4.65 5.68 120 52.71 41.43 16.16 13.23 22 15 3.00 7.18 1.35 2.19 30 7.15 12.96 2.62 4.27 60 20.77 27.95 6.53 8.02 120 78.89 61.01 23.47 19.38 The improvement increases with the number of groups and the number of coordinates. The exploitation of the sparsity is advantageous from between 60 and 120 coordinates. Two levels of parallelism can be used for a better exploitation of the parallelism in larger multicore systems. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 28

Results Raspberry Pi versus Multicore Systems Raspberry Pi: Small and cheap systems with low energy consumption The Stewart Platform Raspberry Pi 2 Model B (RP2) 4 cores ARMv7 32 bits Raspberry Pi 3 Model B (RP3) 4 cores ARMv8 64 bits MKL is not available LAPACK without multithreading is used for LAR CPU Intel Core i5-2400 3.10 GHz (MKL) 4 cores No Hyper-Threading 16 GB RAM CPU Intel Xeon E5-2620 2 GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 29

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 Number of groups: nsg=6 tend=300 tend2=200 ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 30

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 TDP RP2=RP3=4W TDP i5=15w TDP E5=95W ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 33

Results Raspberry Pi versus Multicore Systems Execution Time Energy consumption nsg-t nsg-hs RP2 RP3 i5 E5 RP2 RP3 i5 E5 12 18 1.15 0.68 0.40 0.11 4.6 3.4 6.0 10.5 24 36 5.90 4.08 1.46 0.26 23.6 20.4 21.9 24.7 36 54 15.67 11.67 1.94 0.48 62.7 58.4 29.1 45.6 48 72 31.11 26.72 2.95 0.69 124.4 133.6 44.3 65.6 60 90 58.84 53.36 5.14 1.09 235.4 266.8 77.1 103.4 72 108 93.38 89.17 7.17 1.48 373.5 445.9 107.6 140.6 84 126 148.42 137.69 9.64 2.10 593.7 688.5 144.6 199.5 96 144 214.06 204.77 12.85 2.67 856.2 1023.9 192.8 253.7 Lowest TDP i5: Much low power consumption, slower RP: lowest TDP for the smallest size (SP) Competitive GP for control problems Low power consumption, price and size ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 34

Results Experiments on GPU CPU Intel Xeon E5-2620 2 GHz (MKL) 2 hexa-cores = 12 cores 32 GB RAM GPU GTX950 (MAGMA) ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 35

Results Experiments on GPU Xeon OpenMPxMKL GPU Speed-up Speed-up nsg-t nsg-hs 6x2 2x6 MAGMA 6x2/2x6 6x2/MAGMA 12 18 0.006 0.01 0.331 0.6 0.02 24 36 0.008 0.013 0.336 0.62 0.02 36 54 0.01 0.021 0.373 0.48 0.03 48 72 0.025 0.057 0.417 0.44 0.06 60 90 0.034 0.067 0.492 0.51 0.07 72 108 0.047 0.077 0.563 0.61 0.08 400 400 0.721 0.697 3.638 1.03 0.2 600 600 1.819 2.023 5.995 0.9 0.3 800 800 3.211 4.399 8.808 0.73 0.36 1000 1000 6.928 7.376 13.47 0.94 0.51 2000 2000 43.728 40.588 53.038 1.08 0.82 3000 3000 143.072 115.185 138.463 1.24 1.03 4000 4000 328.249 253.709 293.713 1.29 1.12 GPU would be advantageous only for large problems Not advisable: low execution or power consumption. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 36

Conclusions The computational kimematic formulation based on group equations is a topological approach that exploits the kinematic structure of a MBS to divide it in several SGs of smaller sizes Parallel programming techniques can be applied to solve the equations independently The SP has been used to analyze the Group Equations formulation Lower execution times are obtained with the GE method in comparison with the global formulation Speed-ups achieved is between 4 and 8 Raspberry Pi: a good alternative to general purpose multicores for small control problems, with similar times and lower price, power consumption and space The massive parallelism of GPUs is appropriate for large problems ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 39

Future work The use of other computational libraries (dense and sparse) Auto-tuning techniques should be included in the routines the best parallel strategy and library with the values of some parameters (number of threads, number of steps) For large MBS the use of message-passing parallelism needs to be analyzed. ICCS 17 Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations 40