Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation.

Size: px

Start display at page:

Download "Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation."

Melinda Carr
6 years ago
Views:

1 Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation. Emil Rönnbäck 18th August 2014 Master s Thesis in Computing Science, 30 credits Supervisor at Algoryx and UMIT Research Lab: Claude Lacoursière Supervisor at CS-UmU: Stefan Johansson Examiner: Fredrik Georgsson Umeå University Department of Computing Science SE UMEÅ SWEDEN

3 Abstract Being able to simulate granular matter is important, because they are ubiquitous both in nature and in industry. Some examples of granular materials are ore, sand, coffee, rice, corn, and snow. Research and development of new, more accurate, and faster methods to simulate even more complex materials with millions of particles are needed. In the work of this thesis a typical scene containing thousands of particles has been used for analysing simulation performance using the iterative Gauss-Seidel method adapted to the specifications and capabilities of the Intel Xeon Phi coprocessor. The work began with analysing the performance (wall-clock time and speedup) of a method developed by Algoryx Simulation. The work continued with finding the parts in the code causing bottlenecks and implementing improvements such as a distributed task scheduler and vectorization of operations. In the end, this resulted in shorter execution time and linear speedup using more than 40 threads, compared to 20 in the initial state. We also investigated the benefit of other techniques, such as cache prefetching and usage of huge page sizes, but found no performance gain from these. It is well known that the Xeon Phi coprocessor performs well when executing highly parallel applications, but overload may occur if excessive amount of data is requested by many threads simultaneously. To tackle this issue, the convergence rate of the Gauss-Seidel method during simulation has been measured and suggested modifications of the method decreasing data flow have been implemented and analysed.

4 ii

5 Contents Abstract i 1 Introduction Previous work Related work Algoryx Simulation 3 3 Problem Description Problem statement Goals Purposes Methods Theory Multibody dynamics simulation Kinematic constraints Iterative Methods The Gauss-Seidel Method Projected Gauss-Seidel Parallel Gauss-Seidel Spatial partitioning Algorithm description Hardware: The Intel Xeon Phi Coprocessor Current implementation Cache prefetching Vectorization Results Initial implementation Improvements Task Scheduling iii

6 iv CONTENTS Vectorization Discussion Results in relation to previous work Iterative methods in relation to hardware Convergence of the projected Gauss-Seidel method Examined methods not giving any improvements Conclusions Building for the Intel Xeon Phi Architecture Future work Acknowledgements 29 References 33

7 List of Tables 4.1 Explanation of variables used in the multibody dynamics equation presented in (4.2) Specification of the Intel Xeon Phi Coprocessor architecture compared to the host architecture v

9 List of Figures 3.1 A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks Two objects, B 1 and B 2, at distance x 1 + r 1 x 2 r 2 from each other Spatial partitioning of a scene to be able to solve the motions of particles in parallel, where each region corresponds to a job for one thread Showing a bipartite graph where objects in U are connected through constraints in V A tree of dependence for calculations to be done within each box generated by spatial partitioning Overview of the Intel Xeon Phi architecture Wall clock time graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Speedup graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions Convergence rate for the Gauss-Seidel method vii

11 Chapter 1 Introduction Granular materials are ubiquitous in nature. A few examples of this type of matter are gravel in a bucket, coffee beans in a container and even icebergs. Because of its frequent occurrence, it is also the most common material in industry after water [20]. This is one reason why it is of great interest both from industry and the academy to be able to do simulations of granular matter; for research in how to improve industrial tools and methods, and to get further knowledge of its behavior. Technically, this kind of matter is a collection of distinct macroscopic particles behaving as solid bodies when compact and more like liquid or gas otherwise. The particles are always in frictional contact, and friction is one of the most important contribution to the overall properties. To simulate a granular it is required to solve a large system of linear equations. To increase the speed of processing and grant the possibility to simulate a greater amount of particles, a parallelizable method is a necessity. In this thesis an initial implementation of a method called Parallel Projected Gauss-Seidel(PPGS), developed at Umeå Unversity[12] and Algoryx Simulation AB [17], is analysed and improved. More information and motivation why PPGS is the best choice for solving very large systems of linear equations is given in Section 4.1. Improvements are sought in sense of general speed increase of the algorithm and also performance improvement on a specific hardware - the Intel Xeon Phi coprocessor. Compared to a common x86 processor, the coprocessor consists of many cores and utilizes high bandwidth and vector instructions for high parallelization capability. This combination of properties has never been seen before, that is why it is relevant to find out if this type of processing unit is of interest for physics simulation software. In the simulation software as a whole, there are several tasks to be performed, some sequentially and others in parallel. This thesis focus on the most time consuming step in the simulation algorithm where the motions are calculated. To get an efficient solution as possible, studies of the hardware, the theory of the PPGS method and how to combine those factors into an optimal product have been made. The improvements are primarily results from using Intel Threading Building Blocks[3] for distributed job scheduling and use of assembly-coded functions for vectorized operations specifically for the coprocessor. For performance measurements, simulations are done on a rolling drum containing a large amount 1 of particles, see Figure 3.1. The final results for the solver shows linear speedup using more than 40 threads, compared to linear speedup up to only 20 threads with the initial implementation. 1 scenes with up to 500 thousand particles. 1

12 2 Chapter 1. Introduction 1.1 Previous work In this section, previous work on simulation of granular matter is presented with focus on scalability of the solver used. There exists several methods that can be used to simulate granular matter and therefore it is of interest to investigate what has been done before and which results were accomplished. In a simulation, there are many parameters that influences the computational performance, such as physical properties of the matter simulated which directly affects the mathematical problem that needs to be solved and how the data structures are stored in memory. Considering the parallelization aspect of the algorithms in the simulation, even though some algorithms are easier to parallelize than others, they might not perform better overall. Also, the properties of the platform such as ordinary clusters, GPUs, Intel Xeon Phi make some types of calculations easier than others. All of these mentioned topics are currently active fields of research. Below a slection of three papers addressing the exact same problems as addressed by this thesis, will be discussed briefly. It is beyond the scope of this thesis to present a comprehensive review of literature. Focus was put on porting an existing method to the Intel Xeon Phi coprocessor. In [19], M. Renouf, F. Dubois and P. Alart" investigated a parallelization approach of the Non Smooth Contact Dynamics method for granular matter simulation, where OpenMP were used for thread communication. The simulations were run on up to 6 processors and according to the speedup graph presented, relative speedup stayed above 90% of linearity. In another investigation [18], M. Renouf and P. Alart used a Conjugate Gradient(CG) type algorithms to solve frictional multi-contact problems in application to granular matter. In that result it was concluded that the number of iterations a CG algorithm needed to reach reasonable accuracy was only a third of that of a comparable Gauss-Seidel solver. On a multi-threaded implementation though, there were problems with ill-conditioning which degraded the performance. The CG methods are easier to parallelize than others but, as discussed in the paper, to date their convergence aren t very good. A recent paper [23] by V. Visseq, P. Alart and D. Dureisseix, presented good results for granular simulations where a domain decomposition approach were used. To solve the contact problems, Gauss-Seidel iterations were applied and OpenMP was the mean of communication between threads. The results included simulations of 200k particles were linear speedup was observed for up to around 20 threads, and satisfactory performance after that too. Performance can always be better, and investigating new platforms is of interest to explore new possibilities for better performance and results generally. In this thesis, performance of the relatively new platform Xeon Phi will be investigated. It is a many core architecture, and that is why it might be suitable for highly parallelizable large scale granular matter simulations. Compared to usual clusters it is cheaper which is attractive, and even though it is slower at its current state it has better performance per Watt Related work A prototype solution of granular matter simulation using PPGS has been developed at Umeå University[12] and further improved at Algoryx Simulation AB [17].

13 Chapter 2 Algoryx Simulation Algoryx Simulation is the company at where I, the author, have been situated during the work of this thesis. The company is a leading provider of software and services for visual and interactive physics based simulation. The products they currently provides are AgX Dynamics and Dynamics for SpaceClaim for the professional market, and Algodoo for the education market. Algoryx Simulation collaborates closely with UMIT Research Lab at Umeå university, which in turn was the provider of the opportunity and task of writing this thesis. 3

14 4 Chapter 2. Algoryx Simulation

15 Chapter 3 Problem Description Given a large amount of particles interacting almost exclusively with frictional contact forces, by which means can the Parallel Projected Gauss-Seidel Method (PPGS) be implemented to get good scalability while still keeping the physical correctness? This is the main question of this thesis, which is answered by taking the Intel Xeon Phi coprocessor into account. Other questions that this thesis will try to clarify are: how to parallelize the Gauss-Seidel algorithm for best performance and where are the bottlenecks and how can these be taken care of. Taken into account in this work is that the PPGS method is not naturally parallelizable, and also communication intensive. 3.1 Problem statement In fast and efficient simulations of large systems of granular matter with millions of particles, the biggest challenge is to cope with all the involved computations within a given time constraint. Thus, the aim is to decrease the time needed for calculations. Algorithm optimization for speed increase is mainly what is desired, taking useful properties but also limitations of the Intel Xeon Phi coprocessor in consideration. 3.2 Goals The goals of this thesis is to evaluate and find bottlenecks of an existing implementation of the PPGS method for solving systems of particle interactions, especially considering the hardware properties of the Intel Xeon Phi coprocessor. Further, the goal is to implement proposed improvements for increased performance when using Intel Xeon Phi coprocessor. 3.3 Purposes The purpose of this thesis is to evaluate and improve the PPGS method used in simulation of massive number of particles. Improvements are sought in sense of increased parallelism and computational speed. As in a lot of areas including this, solving a task faster is advantageous to speed up the iteration loop of new research and development. Atop of this, the architecture of Intel Xeon Phi is of essential interest, being many core, x86-compatible and having much higher memory bandwidth than common workstation processors[10]. A performance 5

6 Chapter 3. Problem Description Figure 3.1: A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks.

16 6 Chapter 3. Problem Description Figure 3.1: A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks. evaluation of a non- embarrassingly parallel 1 problem such as granular simulation is of interest to get indications of how well the architecture solves such problems. Whichever the result of the evaluation is, it will clearly state the difference to the standard fewer-core architecture and point out bottlenecks in performance, which might be useful for future implementations; both software- and hardware-related. 3.4 Methods To be able to improve the PPGS method, test simulations must be performed and evaluated. As an initial test, scenes consisting of a rotating drum containing a varying number of particles will be used. This is a scene where the involved particles are interacting as in common everyday scenarios, for example gravel in a bucket. The picture in Figure 3.1 illustrates the drum and the containing granular material. The number of particles within the scenes will range between 50, , 000. This is more challenging than cases with only resting contacts because the neighbour lists change dynamically, and both stiction and sliding friction are present. When executing the simulations, a fixed number of time steps will be made, using 1 up to 228 threads for every scene. These tests will result in measurements of time which will be analyzed to get indications of where the bottlenecks are and what are causing them. In parts of the implementation, such as in the task management and in the Gauss-Seidel algorithm, where modifications may provide improvements, proposals of changes will be implemented and another iteration of tests will be made. 1 A problem that requires no or little effort to separate into a number of parallel tasks.

17 Chapter 4 Theory When performing multibody dynamics simulations, one of the most time consuming steps is to calculate the forces of interacting objects. The non-smooth method we use requires the solution of linear systems of equations, which is done every time step. Basically what is sought is the solution to the linear equation Ax = b. (4.1) where A is an n n matrix. b is a vector of length n, and x = (x 1, x 2,..., x n ) is the vector of unknowns. One way of doing this is to use a direct method (e.g., Gaussian elimination), which computes the exact solution in finite time. The complexity of this method is O(n 2 ), which for large systems becomes very time consuming, and though it also got a high memory footprint it is not well suited for solving large systems of equations [22]. When considering solving for thousands of interactions every time step and being time constrained, using a direct method is accordingly not possible. Hence, another type of method must be used; an iterative method. Using an iterative method it is possible to get a good enough approximation within a set time constraint. Methods of this type iteratively improves the approximation, either until a target precision is met or a time limit is reached [13]. 4.1 Multibody dynamics simulation When simulating interacting objects, the interaction between the objects can be described by the linear system [ ] [ ] [ ] M G T k vk+1 Mv = k + hf k G k Σ λ 4 h Υg (4.2) k + ΥG k v k which is solved with regard to v and λ. Explanations of the variables shown in Equation 4.2 are summarized in 4.1. In Equation 4.2, each column of M and G represents an interacting object and each row of G is a connection between interacting objects. 4.2 Kinematic constraints When simulating mechanical phenomenons, there is a need to mathematically describe different constraints acting upon objects in a scene. A constraint could, for example prevent 7

18 8 Chapter 4. Theory M G Σ v λ Σ Mass matrix: This matrix is a block diagonal n n matrix, where each block is a 3 3 (in 3d) mass matrix of an interacting object Jacobian matrix: This matrix is sparse, containing gradients of functions describing kinematic constraints between objects in the scene (such as collision constraints). This is described in detail in Section 4.2. Diagonal matrix containing regularization parameters. velocity Lagrange multiplier (= 4 ɛ h diag( 1 ɛ τ 1, 2 ɛ h 1+4 τ 2,..., m ) Regularization parameter [14, p. 100] h 1+4 τmn h )) Regularization parameter [14, p. 100] Υ (= diag( 1+4 τ 1, h 1+4 τ 2,..., h 1+4 τmn h Force f k h g Current time step Time step size Constraint violation Table 4.1: Explanation of variables used in the multibody dynamics equation presented in (4.2). objects from moving through each other during collision, act as friction between objects, or lock one object to another in a specified way such as with a hinge or joint. What follows is an example of how a contact constraint is derived and transformed into a useful expression for the rest of the simulation. This is the type of constraints used in granular matter simulations. Consider two objects at position x 1 and x 2 in the world space, as seen in Figure 4.1. The point of collision for B 1 and B 2 is p 1 = x 1 + r 1 and p 2 = x 2 + r 2, (4.3) where the surface normals at the location of contact is n 1 and n 2 for object B 1 and B 2, respectively. From this, the penetration depth can be computed as C = (p 1 p 2 ) n 1 = (x 1 + r 1 x 2 r 2 ) n 1 (4.4) C is the position based penetration constraint for the contact between B 1 and B 2, which is satisfied as long as the two bodies are separated; C 0. In the computations done in the simulations of this thesis, the constraints are velocity based. To form such constraint C, it

19 4.3. Iterative Methods 9 y B 1 r 1 r 2 B 2 x 1 x 2 Figure 4.1: Two objects, B 1 and B 2, at distance x 1 + r 1 x 2 r 2 from each other. x must be derived with respect to time: Ċ = d dt ((x 1 + r 1 x 2 r 2 ) n 1 ) = d dt (x 1 + r 1 x 2 r 2 ) n 1 + (x 1 + r 1 x 2 r 2 ) d dt (n 1) = (v 1 + ω 1 r 1 v 2 ω 2 r 2 ) n 1 + (x 1 + r 1 x 2 r 2 ) d dt (n 1) (v 1 + ω 1 r 1 v 2 ω 2 r 2 ) n 1 = v 1 n 1 + ω 1 (r 1 n 1 ) v 2 n 1 ω 2 (r 2 n 2 ) = ( n T 1 (r 1 n 1 ) T n T 1 (r 2 n 2 ) ) T ω 1 }{{} v 2 J ω 2 }{{} v v 1 (4.5) The approximation on line four in Equation 4.5 can be done since the added on line three is zero when the constraint is satisfied. The J part of Ċ is called a Jacobian matrix and is what G matrix in Equation 4.2 consists of; one Jacobian for every pair of objects in contact. [1] 4.3 Iterative Methods There exists a large number of iterative methods for solving systems of linear equations. Many of these methods (e.g. Jacobi, Gauss-Seidel, Successive over-relaxation) have the form Mx (k+1) = Nx (k) + b (4.6) where A = M N is a splitting of the matrix A. For a method to be practical, it must be computationally cheap to solve a linear system with M as the matrix. Whether (4.6) converges to x = A 1 b is dependent upon the eigenvalues of M 1 N. Specifically, the size of the spectral radius of an n n matrix G defined by ρ(g) = max{ λ : λ σ(g)} (4.7) where σ(g) is the set of all eigenvalues of G. In this case G = M 1 N, and its spectral radius is of critical importance for the success of these iterative methods.

20 10 Chapter 4. Theory Theorem [7] Suppose b R n and A = M N R n n is non-singular. If M is non-singular and the spectral radius of M 1 N satisfies the inequality ρ(m 1 N) < 1, then the iterates x (k) defined by Mx (k+1) = Nx (k) + b converge to x = A 1 b for any starting vector x (0). The consequence of Theorem is that for the system of linear equations shown in Equation 4.6, ρ(m 1 N) < 1 must be met for convergence to be guaranteed. It should be noted that even if this property is fulfilled for M 1 N, the convergence rate may be very slow if the spectral radius is close to 1. Also, every iterative method of the type mentioned above uses its particular way of splitting matrix A. This results in differing properties of the matrix operated upon which in turn influences the convergence rate. One condition that guarantees convergence for these methods is strict diagonal dominance of matrix M 1 N, meaning a ii j i a ij, for all i (4.8) where a ij denotes the entry on row i and column j of the matrix [9, p. 511]. Unfortunately, this property is not fulfilled for the matrices involved in computing contact forces. 4.4 The Gauss-Seidel Method The iterative method that is used in this thesis to solve the linear system of equations Equation 4.1 is the Gauss-Seidel method[7]. In this section, the method is introduced and motivations why this method is chosen are presented. For the Gauss-Seidel method the splitting (4.6) of the matrix A is M = D + L N = U (4.9) where L, D, and U represents the strictly lower triangular, diagonal, and strictly upper triangular parts of A, respectively [9]. In matrix form, the Gauss-Seidel method is expressed as x (k) = (D L) 1 (Ux (k 1) + b), (4.10) where x (k) denotes x at the kth iteration. The equations are solved one at a time, using previously computed values x (k 1) as soon as they are available. At iteration k, the next value of x (k) i, element i of x in the sequence of updates is obtained from the equation x (k) i = b i j<i a ijx (k) j j>i a ijx (k 1) j. (4.11) a ii For Gauss-Seidel, there is an additional property other than strict diagonal dominance which guarantees convergence, which the follow theorem states. Theorem If A R n n is symmetric and positive definite (SPD), then the Gauss- Seidel iteration converges for any x (0). [7] The matrix to solve for multibody dynamics simulation, given in (4.2) is positive definite but not symmetric, and thus, cannot be processed by Gauss-Seidel. To solve this problem, the Schur complement matrix is used instead, defined as S ɛ = GM 1 G T + Σ, (4.12)

21 4.4. The Gauss-Seidel Method 11 which is symmetric and positive definite and corresponds to the linear system S ɛ λ = q GM 1 p. (4.13) Solving (4.12) is equivalent to solving (4.13), and then substitution the results. This means that, after λ has been computed from (4.2), v can be obtain as v = M 1 (G T λ + p). (4.14) Using Gauss-Seidel iteration, it is possible to obtain λ and v approximately without going through the time consuming task of computing S ɛ. For this the needed computations are of the form y Gx, z M 1 G T w. (4.15) Here G and M 1 G T do no have to be computed explicitly either. Instead, these operations are carried out by traversing the constraints and using a packed format for the entries in the matrices. Using the Schur complement, the main stepping equation can be written as v = v (0) + hm 1 f (0) + M 1 G T λ, which is interpreted as v (1) = v (0) + hm 1 f (0) f (c) = G T λ v = v (1) + M 1 f (c) (4.16) where v (1) is the predicted final velocity if no constraint forces are active, f (c) is the impulse caused by constraints, and v is the final velocity. The main linear problem to solve is thus S ɛ λ = q Gv (1). (4.17) Gauss-Seidel is one out of several methods which may be used to solve systems of equations. Jacobi, Conjugate Gradient and Successive Over Relaxation is some other methods to name a few. These methods improve the solution further with more iterations not delivering the perfect answer, but rather a sufficiently good approximation within given time constraints. Going further; arguments for using Gauss-Seidel as the iterative method to use follows. Because the solution of Gauss-Seidel improves smoothly for every iteration run, it is possible to terminate the solver when time is up, and also, it is easy to implement in software. Even though other methods can converge faster in theory, PGS is the only known method which gives satisfactory results given the current techniques. For a thorough discussion and analysis, see [14, p. 318] Projected Gauss-Seidel Projected Gauss-Seidel is a modified version of the original, which solves linear complementarity problems (LCP), or rather mixed linear complementary problems (MLCP), which is a generalization of LCP to include free variables. In context of this thesis, MLCP is connected to physical properties in the scene, which will be explained below. LCP is used to solve for a discontinious relationships between two variables such as those during contacts between granules in the simulations done in this thesis. In the following, an inequality relationship ( ) between vectors will be used, where the inequality holds pairwise

22 12 Chapter 4. Theory for each element in the vectors. Precisely that u = (u 1, u 2,..., u N ) T (v 1, v 2,..., v N ) T = v holds, if and only if u i v i for 1 i N. Given a symmetric matrix A R n n, a vector b R n, find x R n and w R n such that w = Ax b 0, (4.18) x 0, (4.19) x T w = 0. (4.20) The idea to reach a solution is to use the splitting mentioned in (4.9) with a projection operation. Given a vector x R n a vector of lower limits x lo 0, a vector of upper limits x hi 0, where x lo, x hi R n, the projection operation on x is (x) + and means that for each j = 0 to n 1 x + j = max(min(x j, x hij ), x loj ). (4.21) as shown by [8, p. 629] Parallel Gauss-Seidel The P arallel Gauss Seidel method is an extended version of the standard Gauss-Seidel method, having the capability to be run in parallel, thus solving problems faster. For many mechanical systems and especially for systems of granules, the jacobians in G (4.2) has a simple block structure ] G T = [G 1T G 2T... G nct (4.22) where each block row G it is of size n i n, with i n i = m. n c is the total number of constraints. In every block row G it there are only a few number of non-zero column blocks usually within the range [1, 2] as is common for contact constraints. Most commonly, each block column corresponds to a single physical object, which makes the system parallelizable for non-related objects and constraints. Figure 4.3 gives a descriptive view of this, a bipartite graph containing nodes which consists of either objects or constraints, where groups of nodes not connected through edges are independent from others. This is what makes it possible to bypass the computation of the Schur complement in (4.17) mentioned above. [13] 4.5 Spatial partitioning When simulating many thousands of granules, as in our simulations, there is a need for other improvements too, other than identifying independent groups of constrained objects as stated above. The following describes how the simulation software divides the simulation volume spatially into boxes, which ensures that constraints in one box only reaches neighbouring boxes, establishing a method for parallel execution. Consider the right picture in Figure 4.2 which depicts a spatial partitioning of a volume containing particles, where the main partitioning generates equally sized cubes. The constraints between particles within each box is solved locally by different threads resulting in the ability of parallel execution. For places where particles shares a constraint between boxes, the volumes is split up more fine-grained. Doing this creates new boxes, also containing constraints independent of information outside their enclosures. When calculations are carried out, the resulting data from each box not directly adjacent to another can be computed in parallel. Each of the boxes in the picture is

23 4.6. Algorithm description 13 Figure 4.2: Spatial partitioning of a scene to be able to solve the motions of particles in parallel, where each region corresponds to a job for one thread. eight particle-diameters wide, which provides for optimization of the partitioning algorithm, knowing that it can only exist one contact constraint along a straight path between two blue sub-boxes. Each of the boxes in the picture is eight units wide, for a simulation where one unit diameter sized particles are simulated. Having this size of the boxes restricts the number of contact constraints which fits in some dimension(-s) of inner sub-boxes depending on type which grants opportunity for optimization. The left picture in Figure 4.2 shows the partitioning in 2D, giving a more clear view than the 3D-view to the right. It should be noted that the splitting in 3D compared to 2D differs in the way that it contains an additional type of sub-box. Resulting out of this splitting is a tree of dependences where calculations from some boxes needs to be done before others can be handled. This dependency graph is displayed in Figure 4.5, where calculations from each type of box is done in stages. Relating the partitioning to the software, each box of calculations represent an independent task which are enqueued in task scheduler, having the responsibility to hand out work to unoccupied threads. 4.6 Algorithm description When performing physics simulations, there are a number of tasks which must be performed for every object which is in contact with another to resolve the motion. These tasks are given in Algorithm 1. The variables used in the algorithm are the ones described in Table 4.1, where the values of v 0 and λ 0 are those of v and λ from the previous time step, respectively. 4.7 Hardware: The Intel Xeon Phi Coprocessor The targeted hardware for the computations is the Intel Xeon Phi coprocessor[4, 5], which is a so called Many integrated core -architecture product, for large-scale parallel processing. The Intel Xeon Phi coprocessor is connected to a hosting Intel Xeon as a PCI EXpress

24 14 Chapter 4. Theory U V Figure 4.3: Showing a bipartite graph where objects in U are connected through constraints in V. Start Internal 1 Internal 2 Internal 3 Internal 4 edge 1 edge 2 edge 3 edge 4 edge 5 edge 6 corner 1 corner 2 corner 3 Stop Figure 4.4: A tree of dependence for calculations to be done within each box generated by spatial partitioning.

25 4.7. Hardware: The Intel Xeon Phi Coprocessor 15 Algorithm: Projected Gauss-Seidel Given b, M, G, ɛ, h, f 0 Initialize v v (0) + hm 1 f 0, λ λ 0. Compute blocks d ii b G kbm 1 bb GT kb, for k = 1, 2,..., n c repeat for i = 1, 2,..., n c do r b i + n i v b1 + ɛλ (ν) i if b 2 0 then r r n i v b2 end z max(0, r/d ii + λ (ν) λ (ν+1) i λ (ν+1) i v b1 z λ (ν) i i ) z v b1 + M 1 b 1b 1 G T ib 1 λ i if b 2 0 then v b2 v b2 + M 1 b 2b 2 G T ib 2 λ i end end until time is up Algorithm 1: Algorithm used when simulating particle interactions. Running this algorithm is equivalent of performing GS iterations on the Schur complement stated in (4.17). (PCIe) add-on card. In Table 4.2, a summary of the main properties of Intel Xeon Phi coprocessor compared to properties of Intel Xeon host processor are presented. This hardware is aimed at achieving high throughput performance where the available space and power are constrained. An advantage of this hardware is that the programming is done very similarly to standard x86 architecture, while still having the ability of high parallelization. Standard C, C++, and Fortran source code can, without any modifications, be compiled and run on the coprocessor. This is in contrast to other architectures such as various GPUs, where existing implementations needs to be totally rewritten because of significant differences in how the programming is done. The coprocessor is also supported by a rich development environment which includes compilers, numerous libraries for tasks such as threading and math computation and tools for performance tuning/debugging. Although, there is no real IDE available on the coprocessor, which puts constraints on the code development. The coprocessor is suitable for highly parallel applications where the computation to data access ratio is high. Applications can either be run natively on the coprocessor or, by using instructions in the source code, instruct the application to run some parts on the host processor and send highly parallelizable parts to the coprocessor. The gain of doing this is faster execution in total, since the clock frequency of the host processor is more than two times higher than on the coprocessor. The Xeon Phi coprocessor combines 57 cores on a single chip where the cores are connected to each other with the bidirectional Interprocessor Network (IPN) ring. Each core got 512 bit registers for Single Instruction, Multiple Data (SIMD) instructions, each core bidirectionally connected to the others in a ring based fashion and the L2 cache memory is shared among

26 16 Chapter 4. Theory every group of four hardware threads residing in every core, but the L2 memory as a whole is kept coherent among all cores. This means that the total L2 cache memory size sums up to almost 30MB. If threads in different cores performs work on the same data, each core will have its own local copy of the data. The cores have dual in-order execution pipelines which is more simple than out-of-order pipelines, resulting in threads stalling when data is fetched from memory in comparison with the latter as in the host cores where it is possible to execute other instructions while waiting. [15]. Intel Xeon processor: Coprocessor (model: 3120) Host (model: E5-2670) Core frequency 1.1 GHz 2.6 GHz Number of cores 57 8 Hardware threads per core 4 2 L1 Cache Size kb per core kb L2 Cache Size 31 MB (512kB*62) 2 MB Max. Memory Bandwidth 240 GB/s 51 GB/s Table 4.2: Specification of the Intel Xeon Phi Coprocessor architecture compared to the host architecture. [11] PCIe client logic Core L2 Core L2 Core L2 Core L2 Intel Xeon Processor PCIe GDDR MC TD TD TD TD GDDR MC GDDR MC TD TD TD TD GDDR MC L2 Core L2 Core L2 Core L2 Core System Memory Figure 4.5: Overview of the Intel Xeon Phi architecture. Figure 4.5 shows a graphical description of the coprocessor architecture and how it is related to the rest of the computer system. First,to the left is the Intel Xeon host processor, which the coprocessor is connected to through a PCIe bus. There is a lightweight Linux distribution installed on the card which is accessed through Secure Shell (SSH) from the host processor. There is also a TCP/IP stack implemented over PCIe bus, which allows users to access the coprocessor as a network node. To the right, the main parts of the coprocessor can be seen which are processing cores, caches, GDDR memory controllers and a very high bandwidth, bidirectional ring which interconnects the system. The PCIe client logic and the memory controller provides direct access the memory and the PCIe bus respectively, without any intervention with the host processor. Each core possesses their own L2 cache which is kept coherent by a globally distributed tag directory (TD). The bidirectional ring used for communication actually consists of five independent rings in each direction. The widest, most expensive is the data block ring. This is 64 bytes wide to support the high

27 4.8. Current implementation 17 bandwidth requirement from the many connected cores. The other four rings are smaller; two used for sending read/write commands and memory addresses, and the other two for acknowledgement messages. For the cores to process work as fast as possible it is required to work around the problem of delays in the cache when new data is fetched. In the coprocessor, this is done by allowing four hardware threads to run simultaneously which reduces the wait time [6]. 4.8 Current implementation Using the Xeon Phi architecture, parts of an implementation can be run on a host chip with single-thread performance while other parts, more suitable for parallelization can be executed on the coprocessor. The main factors influencing performance are scalability of algorithms, vectorization of instructions and memory utilization. These properties are of particular interest in the following analysis and discussion regarding the initial implementation and future improvements of the PPGS method used for particle simulations. What follows is explanations of different areas within the context of the PPGS algorithm where proposals for improvements have been stated Cache prefetching There are possibilities to improve performance by applying cache prefetching for certain data in the code. This hides the latencies arising when data is read from memory. Prefetching means that explicit instructions are given to the processor to fetch data from the main memory to the cache just before it is needed. Higher performance may result from this because threads is not forced to stall due to memory delay. By default, hardware prefetching is used, meaning that mechanisms in the hardware are responsible for prefetching. This functionality is triggered when the software tries to access data not found in the cache in which case multiple cache lines is fetched from the main memory. Instead of letting the hardware do the prefetching automatically, instructions in the software code can be used to make prefetching events be based upon knowledge of future memory accesses instead of cache misses Vectorization To get good performance out of the Intel Xeon Phi coprocessor it is necessary to use vectorization instructions. This means taking advantage of the 512 bit wide SIMD registers residing in the cores of the coprocessor. Ways of doing this ranges from using optimized library functions, writing assembly code or calling intrinsic functions that mimic assembly of which the last method is used in this work. This corresponds to, in the code, instead of using the usual mathematical operators such as and +, calling functions which are executed on specialized hardware. Input arguments to these functions are 512 bits of aligned data which corresponds to 8 doubles, or 16 floats. Operations are then carried out pairwise on the values of the input, i.e, for the arguments A = {a 1,..., a n } and B = {b 1,..., b n } and the return value C = {c 1,..., c n }, the n operations a i b i = c i are done simultaneously, where n is 8 for doubles or 16 for floats. Since these operations are executed on specialized hardware, less clock cycles are required, which in this case is equal to faster execution. To give the reader understanding and insight in what intrinsics are, a few examples out of many is given in Listings 4.1 to 4.3.

28 18 Chapter 4. Theory Listing 4.1: Multiply instruction m512d _mm512_mul_pd ( m512d a, m512d b ) ; Listing 4.2: Multiply-add instruction m512d _mm512_fmadd_pd ( m512d a, m512d b, m512d c ) Listing 4.3: Reduce-add instruction double _mm512_reduce_add_pd ( m512d a ) The functions in Listings 4.1 to 4.3 operates on packed double-precision vectors taken as arguments. In Listing 4.1, the function multiplies the elements from a and b and returns the result. The function in Listing 4.2 multiplies elements from a and b, adds elements from c to the product and returns the result. In Listing 4.3, the function sums all the elements in a and returns the result.

29 Chapter 5 Results What follows in this section is the results of what has been accomplished during the work of the thesis. Discussion about the results, the Gauss-Seidel method and iterative methods follows. For analysis and testing of the modifications of the Gauss-Seidel-iteration code, a reference scene consisting of a drum containing 50, ,000 particles has been used. Performance graphs are presented with both time and speedup measurements. For speedup, Equation 5.1 is used, where S is the speedup and T s and T p are the times it take to run the simulation on the coprocessor on a single thread and in parallel, respectively. S = T s T p (5.1) 5.1 Initial implementation The initial implementation of the granular simulation was already done by Algoryx Simulation. For later comparisons with the improvements done in this thesis, the initial implementation was first run on the coprocessor without any modifications. In Figure 5.1 the initial measures of time time can be seen for simulations containing various numbers of particles. As seen, the coprocessor get saturated at 30 threads for the scene containing 500k particles and at 25 threads for the other scenes. Figure 5.1 shows the speedup graph for the initial simulation. For all 3 test scenes, linear speedup is present up to 25 threads, but then decreases. A speedup graph like this was expected and similar results are seen in previous work by R.Meyer[16]. Most likely the reason why no improvements are seen with increased number of threads are because of the increased number of cache misses and the excessive amount of simultaneous data transfers. 5.2 Improvements This section presents the examination of the improvements to be implemented suggested in Section 4.8, some of which have led to great results, while others did not give any significant improvements. 19

30 20 Chapter 5. Results Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.1: Wall clock time graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Speedup 50k 100k 500k speedup #threads Figure 5.2: Speedup graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor.

31 5.2. Improvements 21 Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.3: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Task Scheduling The initial implementation of the granular matter simulation used a centralized task scheduler which, on request by any of the many worker threads, sends additional work to be computed by the thread. During the analysis of the simulation process it was discovered that this approach created a bottleneck because a lot of threads requested work at a centralized point causing thread locks. By utilizing the library "Threading building blocks" developed by Intel, the problem could be solved. This was done by promoting all the threads as task schedulers. In this manner, as soon as a thread is finished with all its work, a global request for more work is made and any of the other threads having excess work are able to respond with new tasks. As seen in Figure 5.3 and 5.4 this change gave great improvements. The speedup graph in Figure 5.4 shows almost linear speedup to up to 40 threads but then a clear degradation. It is likely that the reason for this is that the maximum throughput between the processing cores is reached Vectorization In Figure 5.5 and 5.6 measurements of the performance using vectorization instructions are presented. Comparing these results with the previous in Figures 5.3 and 5.4 show almost no improvements. This is explained by the reason that it were too few instructions vectorized and too much logic that disrupted the vectorization pipeline.

32 22 Chapter 5. Results Speedup 50k 100k 500k speedup #threads Figure 5.4: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler. Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.5: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions.

33 5.2. Improvements Speedup 50k 100k 500k speedup #threads Figure 5.6: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions.

34 24 Chapter 5. Results

35 Chapter 6 Discussion 6.1 Results in relation to previous work In the results it can be seen that simulation follow linear speedup up to around 40 threads in the case when simulating 500k particles. Compared to the work done in [19] mentioned in Section 1.1, on tests using friction coefficients where speedup deviated from linearity before using 5 processors, the results here are good. Presumably, it is not possible to blindly compare speedup graphs of the two results like this because of the big differences in the initial conditions. The simulations in [19] only used 1000 physical objects compared to 5, 000, 000 in this, meaning that the amount of computations in relation to communication is a lot less with fewer objects. 6.2 Iterative methods in relation to hardware In previous work[21], it has been shown that the bottleneck occurring when solving nontrivially parallel tasks on the Xeon Phi architecture is caused by the constraints on the memory bandwidth. The problem occurs because a big amount of data must be transfered between the cores and storage space, meanwhile the operations performed on the data takes little time in comparison. This is especially true when running PGS, where solutions of tasks continuously must be passed from one thread to other threads performing related tasks. When designing solutions to problems, great care must be taken to balance the processing to data access ratio, to have data set up in a Structure ofarrays(soa) pattern, and preferably having data accesses aligned; fulfilling all of these concepts to get as high performance as possible. Further, the problem arising with the data access bottleneck tells that the system could afford a more computationally intensive solver given the same data access rate. 6.3 Convergence of the projected Gauss-Seidel method A modification of the GS method that increases the calculational work and lessens the communication between threads would fit the Intel Xeon Phi coprocessor hardware better. More research has to be done, but in the best case, this may also improve on the convergence rate, but currently, that is just vivid speculations. Figure 6.1 shows the convergence rate of the currently used GS method measured for the scene containing 500k particles. The x- and y- axes indicates number of iterations and residual of the contact forces, respectively, and it 25

36 26 Chapter 6. Discussion Convergence (500k particles) normal residual tangentu residual tangentv residual residual #iterations Figure 6.1: Convergence rate for the Gauss-Seidel method. is clear that a stagnation of the convergence rate has occurred. If ways are found to increase the rate of convergence, more accurate solutions for the simulations can be calculated within given time constraints. 6.4 Examined methods not giving any improvements What are presented here are investigated ways to improve the simulation software that either did not give any significant improvements or just sporadic performance increase without any clear explanation for it. Cache prefetching Some simulation tests were made where cache prefetching were applied for the particle properties (velocity, mass) and the constraints (Jacobians). Tests were made where all types of the mentioned values were prefetched and also were only some of them were. Speed increases of up to 5% were seen for some tests but no consistency could be derived because a combination of prefetches caused speed up in some cases but not in others, and thats why no results of this are presented. Huge Page Size According to [2] performance may be increased by using huge page sizes for memory allocations on the coprocessor. When an operating system allocates memory, a page is the smallest unit of data which may be allocated. The page size for computers is usually determined by architecture. For the x86-64 architectures, a standard page size is 4KB, and a huge page size is 2MB. By using huge page sizes, performance may be better since sometimes, variables and buffers are then handled more efficiently. Some simulation tests have been run to determine whether using huge page size gives any improvements, but no difference in performance was observed.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip