Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation.

Size: px
Start display at page:

Download "Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation."

Transcription

1 Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor Application to granular matter simulation. Emil Rönnbäck 18th August 2014 Master s Thesis in Computing Science, 30 credits Supervisor at Algoryx and UMIT Research Lab: Claude Lacoursière Supervisor at CS-UmU: Stefan Johansson Examiner: Fredrik Georgsson Umeå University Department of Computing Science SE UMEÅ SWEDEN

2

3 Abstract Being able to simulate granular matter is important, because they are ubiquitous both in nature and in industry. Some examples of granular materials are ore, sand, coffee, rice, corn, and snow. Research and development of new, more accurate, and faster methods to simulate even more complex materials with millions of particles are needed. In the work of this thesis a typical scene containing thousands of particles has been used for analysing simulation performance using the iterative Gauss-Seidel method adapted to the specifications and capabilities of the Intel Xeon Phi coprocessor. The work began with analysing the performance (wall-clock time and speedup) of a method developed by Algoryx Simulation. The work continued with finding the parts in the code causing bottlenecks and implementing improvements such as a distributed task scheduler and vectorization of operations. In the end, this resulted in shorter execution time and linear speedup using more than 40 threads, compared to 20 in the initial state. We also investigated the benefit of other techniques, such as cache prefetching and usage of huge page sizes, but found no performance gain from these. It is well known that the Xeon Phi coprocessor performs well when executing highly parallel applications, but overload may occur if excessive amount of data is requested by many threads simultaneously. To tackle this issue, the convergence rate of the Gauss-Seidel method during simulation has been measured and suggested modifications of the method decreasing data flow have been implemented and analysed.

4 ii

5 Contents Abstract i 1 Introduction Previous work Related work Algoryx Simulation 3 3 Problem Description Problem statement Goals Purposes Methods Theory Multibody dynamics simulation Kinematic constraints Iterative Methods The Gauss-Seidel Method Projected Gauss-Seidel Parallel Gauss-Seidel Spatial partitioning Algorithm description Hardware: The Intel Xeon Phi Coprocessor Current implementation Cache prefetching Vectorization Results Initial implementation Improvements Task Scheduling iii

6 iv CONTENTS Vectorization Discussion Results in relation to previous work Iterative methods in relation to hardware Convergence of the projected Gauss-Seidel method Examined methods not giving any improvements Conclusions Building for the Intel Xeon Phi Architecture Future work Acknowledgements 29 References 33

7 List of Tables 4.1 Explanation of variables used in the multibody dynamics equation presented in (4.2) Specification of the Intel Xeon Phi Coprocessor architecture compared to the host architecture v

8

9 List of Figures 3.1 A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks Two objects, B 1 and B 2, at distance x 1 + r 1 x 2 r 2 from each other Spatial partitioning of a scene to be able to solve the motions of particles in parallel, where each region corresponds to a job for one thread Showing a bipartite graph where objects in U are connected through constraints in V A tree of dependence for calculations to be done within each box generated by spatial partitioning Overview of the Intel Xeon Phi architecture Wall clock time graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Speedup graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions Convergence rate for the Gauss-Seidel method vii

10

11 Chapter 1 Introduction Granular materials are ubiquitous in nature. A few examples of this type of matter are gravel in a bucket, coffee beans in a container and even icebergs. Because of its frequent occurrence, it is also the most common material in industry after water [20]. This is one reason why it is of great interest both from industry and the academy to be able to do simulations of granular matter; for research in how to improve industrial tools and methods, and to get further knowledge of its behavior. Technically, this kind of matter is a collection of distinct macroscopic particles behaving as solid bodies when compact and more like liquid or gas otherwise. The particles are always in frictional contact, and friction is one of the most important contribution to the overall properties. To simulate a granular it is required to solve a large system of linear equations. To increase the speed of processing and grant the possibility to simulate a greater amount of particles, a parallelizable method is a necessity. In this thesis an initial implementation of a method called Parallel Projected Gauss-Seidel(PPGS), developed at Umeå Unversity[12] and Algoryx Simulation AB [17], is analysed and improved. More information and motivation why PPGS is the best choice for solving very large systems of linear equations is given in Section 4.1. Improvements are sought in sense of general speed increase of the algorithm and also performance improvement on a specific hardware - the Intel Xeon Phi coprocessor. Compared to a common x86 processor, the coprocessor consists of many cores and utilizes high bandwidth and vector instructions for high parallelization capability. This combination of properties has never been seen before, that is why it is relevant to find out if this type of processing unit is of interest for physics simulation software. In the simulation software as a whole, there are several tasks to be performed, some sequentially and others in parallel. This thesis focus on the most time consuming step in the simulation algorithm where the motions are calculated. To get an efficient solution as possible, studies of the hardware, the theory of the PPGS method and how to combine those factors into an optimal product have been made. The improvements are primarily results from using Intel Threading Building Blocks[3] for distributed job scheduling and use of assembly-coded functions for vectorized operations specifically for the coprocessor. For performance measurements, simulations are done on a rolling drum containing a large amount 1 of particles, see Figure 3.1. The final results for the solver shows linear speedup using more than 40 threads, compared to linear speedup up to only 20 threads with the initial implementation. 1 scenes with up to 500 thousand particles. 1

12 2 Chapter 1. Introduction 1.1 Previous work In this section, previous work on simulation of granular matter is presented with focus on scalability of the solver used. There exists several methods that can be used to simulate granular matter and therefore it is of interest to investigate what has been done before and which results were accomplished. In a simulation, there are many parameters that influences the computational performance, such as physical properties of the matter simulated which directly affects the mathematical problem that needs to be solved and how the data structures are stored in memory. Considering the parallelization aspect of the algorithms in the simulation, even though some algorithms are easier to parallelize than others, they might not perform better overall. Also, the properties of the platform such as ordinary clusters, GPUs, Intel Xeon Phi make some types of calculations easier than others. All of these mentioned topics are currently active fields of research. Below a slection of three papers addressing the exact same problems as addressed by this thesis, will be discussed briefly. It is beyond the scope of this thesis to present a comprehensive review of literature. Focus was put on porting an existing method to the Intel Xeon Phi coprocessor. In [19], M. Renouf, F. Dubois and P. Alart" investigated a parallelization approach of the Non Smooth Contact Dynamics method for granular matter simulation, where OpenMP were used for thread communication. The simulations were run on up to 6 processors and according to the speedup graph presented, relative speedup stayed above 90% of linearity. In another investigation [18], M. Renouf and P. Alart used a Conjugate Gradient(CG) type algorithms to solve frictional multi-contact problems in application to granular matter. In that result it was concluded that the number of iterations a CG algorithm needed to reach reasonable accuracy was only a third of that of a comparable Gauss-Seidel solver. On a multi-threaded implementation though, there were problems with ill-conditioning which degraded the performance. The CG methods are easier to parallelize than others but, as discussed in the paper, to date their convergence aren t very good. A recent paper [23] by V. Visseq, P. Alart and D. Dureisseix, presented good results for granular simulations where a domain decomposition approach were used. To solve the contact problems, Gauss-Seidel iterations were applied and OpenMP was the mean of communication between threads. The results included simulations of 200k particles were linear speedup was observed for up to around 20 threads, and satisfactory performance after that too. Performance can always be better, and investigating new platforms is of interest to explore new possibilities for better performance and results generally. In this thesis, performance of the relatively new platform Xeon Phi will be investigated. It is a many core architecture, and that is why it might be suitable for highly parallelizable large scale granular matter simulations. Compared to usual clusters it is cheaper which is attractive, and even though it is slower at its current state it has better performance per Watt Related work A prototype solution of granular matter simulation using PPGS has been developed at Umeå University[12] and further improved at Algoryx Simulation AB [17].

13 Chapter 2 Algoryx Simulation Algoryx Simulation is the company at where I, the author, have been situated during the work of this thesis. The company is a leading provider of software and services for visual and interactive physics based simulation. The products they currently provides are AgX Dynamics and Dynamics for SpaceClaim for the professional market, and Algodoo for the education market. Algoryx Simulation collaborates closely with UMIT Research Lab at Umeå university, which in turn was the provider of the opportunity and task of writing this thesis. 3

14 4 Chapter 2. Algoryx Simulation

15 Chapter 3 Problem Description Given a large amount of particles interacting almost exclusively with frictional contact forces, by which means can the Parallel Projected Gauss-Seidel Method (PPGS) be implemented to get good scalability while still keeping the physical correctness? This is the main question of this thesis, which is answered by taking the Intel Xeon Phi coprocessor into account. Other questions that this thesis will try to clarify are: how to parallelize the Gauss-Seidel algorithm for best performance and where are the bottlenecks and how can these be taken care of. Taken into account in this work is that the PPGS method is not naturally parallelizable, and also communication intensive. 3.1 Problem statement In fast and efficient simulations of large systems of granular matter with millions of particles, the biggest challenge is to cope with all the involved computations within a given time constraint. Thus, the aim is to decrease the time needed for calculations. Algorithm optimization for speed increase is mainly what is desired, taking useful properties but also limitations of the Intel Xeon Phi coprocessor in consideration. 3.2 Goals The goals of this thesis is to evaluate and find bottlenecks of an existing implementation of the PPGS method for solving systems of particle interactions, especially considering the hardware properties of the Intel Xeon Phi coprocessor. Further, the goal is to implement proposed improvements for increased performance when using Intel Xeon Phi coprocessor. 3.3 Purposes The purpose of this thesis is to evaluate and improve the PPGS method used in simulation of massive number of particles. Improvements are sought in sense of increased parallelism and computational speed. As in a lot of areas including this, solving a task faster is advantageous to speed up the iteration loop of new research and development. Atop of this, the architecture of Intel Xeon Phi is of essential interest, being many core, x86-compatible and having much higher memory bandwidth than common workstation processors[10]. A performance 5

16 6 Chapter 3. Problem Description Figure 3.1: A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks. evaluation of a non- embarrassingly parallel 1 problem such as granular simulation is of interest to get indications of how well the architecture solves such problems. Whichever the result of the evaluation is, it will clearly state the difference to the standard fewer-core architecture and point out bottlenecks in performance, which might be useful for future implementations; both software- and hardware-related. 3.4 Methods To be able to improve the PPGS method, test simulations must be performed and evaluated. As an initial test, scenes consisting of a rotating drum containing a varying number of particles will be used. This is a scene where the involved particles are interacting as in common everyday scenarios, for example gravel in a bucket. The picture in Figure 3.1 illustrates the drum and the containing granular material. The number of particles within the scenes will range between 50, , 000. This is more challenging than cases with only resting contacts because the neighbour lists change dynamically, and both stiction and sliding friction are present. When executing the simulations, a fixed number of time steps will be made, using 1 up to 228 threads for every scene. These tests will result in measurements of time which will be analyzed to get indications of where the bottlenecks are and what are causing them. In parts of the implementation, such as in the task management and in the Gauss-Seidel algorithm, where modifications may provide improvements, proposals of changes will be implemented and another iteration of tests will be made. 1 A problem that requires no or little effort to separate into a number of parallel tasks.

17 Chapter 4 Theory When performing multibody dynamics simulations, one of the most time consuming steps is to calculate the forces of interacting objects. The non-smooth method we use requires the solution of linear systems of equations, which is done every time step. Basically what is sought is the solution to the linear equation Ax = b. (4.1) where A is an n n matrix. b is a vector of length n, and x = (x 1, x 2,..., x n ) is the vector of unknowns. One way of doing this is to use a direct method (e.g., Gaussian elimination), which computes the exact solution in finite time. The complexity of this method is O(n 2 ), which for large systems becomes very time consuming, and though it also got a high memory footprint it is not well suited for solving large systems of equations [22]. When considering solving for thousands of interactions every time step and being time constrained, using a direct method is accordingly not possible. Hence, another type of method must be used; an iterative method. Using an iterative method it is possible to get a good enough approximation within a set time constraint. Methods of this type iteratively improves the approximation, either until a target precision is met or a time limit is reached [13]. 4.1 Multibody dynamics simulation When simulating interacting objects, the interaction between the objects can be described by the linear system [ ] [ ] [ ] M G T k vk+1 Mv = k + hf k G k Σ λ 4 h Υg (4.2) k + ΥG k v k which is solved with regard to v and λ. Explanations of the variables shown in Equation 4.2 are summarized in 4.1. In Equation 4.2, each column of M and G represents an interacting object and each row of G is a connection between interacting objects. 4.2 Kinematic constraints When simulating mechanical phenomenons, there is a need to mathematically describe different constraints acting upon objects in a scene. A constraint could, for example prevent 7

18 8 Chapter 4. Theory M G Σ v λ Σ Mass matrix: This matrix is a block diagonal n n matrix, where each block is a 3 3 (in 3d) mass matrix of an interacting object Jacobian matrix: This matrix is sparse, containing gradients of functions describing kinematic constraints between objects in the scene (such as collision constraints). This is described in detail in Section 4.2. Diagonal matrix containing regularization parameters. velocity Lagrange multiplier (= 4 ɛ h diag( 1 ɛ τ 1, 2 ɛ h 1+4 τ 2,..., m ) Regularization parameter [14, p. 100] h 1+4 τmn h )) Regularization parameter [14, p. 100] Υ (= diag( 1+4 τ 1, h 1+4 τ 2,..., h 1+4 τmn h Force f k h g Current time step Time step size Constraint violation Table 4.1: Explanation of variables used in the multibody dynamics equation presented in (4.2). objects from moving through each other during collision, act as friction between objects, or lock one object to another in a specified way such as with a hinge or joint. What follows is an example of how a contact constraint is derived and transformed into a useful expression for the rest of the simulation. This is the type of constraints used in granular matter simulations. Consider two objects at position x 1 and x 2 in the world space, as seen in Figure 4.1. The point of collision for B 1 and B 2 is p 1 = x 1 + r 1 and p 2 = x 2 + r 2, (4.3) where the surface normals at the location of contact is n 1 and n 2 for object B 1 and B 2, respectively. From this, the penetration depth can be computed as C = (p 1 p 2 ) n 1 = (x 1 + r 1 x 2 r 2 ) n 1 (4.4) C is the position based penetration constraint for the contact between B 1 and B 2, which is satisfied as long as the two bodies are separated; C 0. In the computations done in the simulations of this thesis, the constraints are velocity based. To form such constraint C, it

19 4.3. Iterative Methods 9 y B 1 r 1 r 2 B 2 x 1 x 2 Figure 4.1: Two objects, B 1 and B 2, at distance x 1 + r 1 x 2 r 2 from each other. x must be derived with respect to time: Ċ = d dt ((x 1 + r 1 x 2 r 2 ) n 1 ) = d dt (x 1 + r 1 x 2 r 2 ) n 1 + (x 1 + r 1 x 2 r 2 ) d dt (n 1) = (v 1 + ω 1 r 1 v 2 ω 2 r 2 ) n 1 + (x 1 + r 1 x 2 r 2 ) d dt (n 1) (v 1 + ω 1 r 1 v 2 ω 2 r 2 ) n 1 = v 1 n 1 + ω 1 (r 1 n 1 ) v 2 n 1 ω 2 (r 2 n 2 ) = ( n T 1 (r 1 n 1 ) T n T 1 (r 2 n 2 ) ) T ω 1 }{{} v 2 J ω 2 }{{} v v 1 (4.5) The approximation on line four in Equation 4.5 can be done since the added on line three is zero when the constraint is satisfied. The J part of Ċ is called a Jacobian matrix and is what G matrix in Equation 4.2 consists of; one Jacobian for every pair of objects in contact. [1] 4.3 Iterative Methods There exists a large number of iterative methods for solving systems of linear equations. Many of these methods (e.g. Jacobi, Gauss-Seidel, Successive over-relaxation) have the form Mx (k+1) = Nx (k) + b (4.6) where A = M N is a splitting of the matrix A. For a method to be practical, it must be computationally cheap to solve a linear system with M as the matrix. Whether (4.6) converges to x = A 1 b is dependent upon the eigenvalues of M 1 N. Specifically, the size of the spectral radius of an n n matrix G defined by ρ(g) = max{ λ : λ σ(g)} (4.7) where σ(g) is the set of all eigenvalues of G. In this case G = M 1 N, and its spectral radius is of critical importance for the success of these iterative methods.

20 10 Chapter 4. Theory Theorem [7] Suppose b R n and A = M N R n n is non-singular. If M is non-singular and the spectral radius of M 1 N satisfies the inequality ρ(m 1 N) < 1, then the iterates x (k) defined by Mx (k+1) = Nx (k) + b converge to x = A 1 b for any starting vector x (0). The consequence of Theorem is that for the system of linear equations shown in Equation 4.6, ρ(m 1 N) < 1 must be met for convergence to be guaranteed. It should be noted that even if this property is fulfilled for M 1 N, the convergence rate may be very slow if the spectral radius is close to 1. Also, every iterative method of the type mentioned above uses its particular way of splitting matrix A. This results in differing properties of the matrix operated upon which in turn influences the convergence rate. One condition that guarantees convergence for these methods is strict diagonal dominance of matrix M 1 N, meaning a ii j i a ij, for all i (4.8) where a ij denotes the entry on row i and column j of the matrix [9, p. 511]. Unfortunately, this property is not fulfilled for the matrices involved in computing contact forces. 4.4 The Gauss-Seidel Method The iterative method that is used in this thesis to solve the linear system of equations Equation 4.1 is the Gauss-Seidel method[7]. In this section, the method is introduced and motivations why this method is chosen are presented. For the Gauss-Seidel method the splitting (4.6) of the matrix A is M = D + L N = U (4.9) where L, D, and U represents the strictly lower triangular, diagonal, and strictly upper triangular parts of A, respectively [9]. In matrix form, the Gauss-Seidel method is expressed as x (k) = (D L) 1 (Ux (k 1) + b), (4.10) where x (k) denotes x at the kth iteration. The equations are solved one at a time, using previously computed values x (k 1) as soon as they are available. At iteration k, the next value of x (k) i, element i of x in the sequence of updates is obtained from the equation x (k) i = b i j<i a ijx (k) j j>i a ijx (k 1) j. (4.11) a ii For Gauss-Seidel, there is an additional property other than strict diagonal dominance which guarantees convergence, which the follow theorem states. Theorem If A R n n is symmetric and positive definite (SPD), then the Gauss- Seidel iteration converges for any x (0). [7] The matrix to solve for multibody dynamics simulation, given in (4.2) is positive definite but not symmetric, and thus, cannot be processed by Gauss-Seidel. To solve this problem, the Schur complement matrix is used instead, defined as S ɛ = GM 1 G T + Σ, (4.12)

21 4.4. The Gauss-Seidel Method 11 which is symmetric and positive definite and corresponds to the linear system S ɛ λ = q GM 1 p. (4.13) Solving (4.12) is equivalent to solving (4.13), and then substitution the results. This means that, after λ has been computed from (4.2), v can be obtain as v = M 1 (G T λ + p). (4.14) Using Gauss-Seidel iteration, it is possible to obtain λ and v approximately without going through the time consuming task of computing S ɛ. For this the needed computations are of the form y Gx, z M 1 G T w. (4.15) Here G and M 1 G T do no have to be computed explicitly either. Instead, these operations are carried out by traversing the constraints and using a packed format for the entries in the matrices. Using the Schur complement, the main stepping equation can be written as v = v (0) + hm 1 f (0) + M 1 G T λ, which is interpreted as v (1) = v (0) + hm 1 f (0) f (c) = G T λ v = v (1) + M 1 f (c) (4.16) where v (1) is the predicted final velocity if no constraint forces are active, f (c) is the impulse caused by constraints, and v is the final velocity. The main linear problem to solve is thus S ɛ λ = q Gv (1). (4.17) Gauss-Seidel is one out of several methods which may be used to solve systems of equations. Jacobi, Conjugate Gradient and Successive Over Relaxation is some other methods to name a few. These methods improve the solution further with more iterations not delivering the perfect answer, but rather a sufficiently good approximation within given time constraints. Going further; arguments for using Gauss-Seidel as the iterative method to use follows. Because the solution of Gauss-Seidel improves smoothly for every iteration run, it is possible to terminate the solver when time is up, and also, it is easy to implement in software. Even though other methods can converge faster in theory, PGS is the only known method which gives satisfactory results given the current techniques. For a thorough discussion and analysis, see [14, p. 318] Projected Gauss-Seidel Projected Gauss-Seidel is a modified version of the original, which solves linear complementarity problems (LCP), or rather mixed linear complementary problems (MLCP), which is a generalization of LCP to include free variables. In context of this thesis, MLCP is connected to physical properties in the scene, which will be explained below. LCP is used to solve for a discontinious relationships between two variables such as those during contacts between granules in the simulations done in this thesis. In the following, an inequality relationship ( ) between vectors will be used, where the inequality holds pairwise

22 12 Chapter 4. Theory for each element in the vectors. Precisely that u = (u 1, u 2,..., u N ) T (v 1, v 2,..., v N ) T = v holds, if and only if u i v i for 1 i N. Given a symmetric matrix A R n n, a vector b R n, find x R n and w R n such that w = Ax b 0, (4.18) x 0, (4.19) x T w = 0. (4.20) The idea to reach a solution is to use the splitting mentioned in (4.9) with a projection operation. Given a vector x R n a vector of lower limits x lo 0, a vector of upper limits x hi 0, where x lo, x hi R n, the projection operation on x is (x) + and means that for each j = 0 to n 1 x + j = max(min(x j, x hij ), x loj ). (4.21) as shown by [8, p. 629] Parallel Gauss-Seidel The P arallel Gauss Seidel method is an extended version of the standard Gauss-Seidel method, having the capability to be run in parallel, thus solving problems faster. For many mechanical systems and especially for systems of granules, the jacobians in G (4.2) has a simple block structure ] G T = [G 1T G 2T... G nct (4.22) where each block row G it is of size n i n, with i n i = m. n c is the total number of constraints. In every block row G it there are only a few number of non-zero column blocks usually within the range [1, 2] as is common for contact constraints. Most commonly, each block column corresponds to a single physical object, which makes the system parallelizable for non-related objects and constraints. Figure 4.3 gives a descriptive view of this, a bipartite graph containing nodes which consists of either objects or constraints, where groups of nodes not connected through edges are independent from others. This is what makes it possible to bypass the computation of the Schur complement in (4.17) mentioned above. [13] 4.5 Spatial partitioning When simulating many thousands of granules, as in our simulations, there is a need for other improvements too, other than identifying independent groups of constrained objects as stated above. The following describes how the simulation software divides the simulation volume spatially into boxes, which ensures that constraints in one box only reaches neighbouring boxes, establishing a method for parallel execution. Consider the right picture in Figure 4.2 which depicts a spatial partitioning of a volume containing particles, where the main partitioning generates equally sized cubes. The constraints between particles within each box is solved locally by different threads resulting in the ability of parallel execution. For places where particles shares a constraint between boxes, the volumes is split up more fine-grained. Doing this creates new boxes, also containing constraints independent of information outside their enclosures. When calculations are carried out, the resulting data from each box not directly adjacent to another can be computed in parallel. Each of the boxes in the picture is

23 4.6. Algorithm description 13 Figure 4.2: Spatial partitioning of a scene to be able to solve the motions of particles in parallel, where each region corresponds to a job for one thread. eight particle-diameters wide, which provides for optimization of the partitioning algorithm, knowing that it can only exist one contact constraint along a straight path between two blue sub-boxes. Each of the boxes in the picture is eight units wide, for a simulation where one unit diameter sized particles are simulated. Having this size of the boxes restricts the number of contact constraints which fits in some dimension(-s) of inner sub-boxes depending on type which grants opportunity for optimization. The left picture in Figure 4.2 shows the partitioning in 2D, giving a more clear view than the 3D-view to the right. It should be noted that the splitting in 3D compared to 2D differs in the way that it contains an additional type of sub-box. Resulting out of this splitting is a tree of dependences where calculations from some boxes needs to be done before others can be handled. This dependency graph is displayed in Figure 4.5, where calculations from each type of box is done in stages. Relating the partitioning to the software, each box of calculations represent an independent task which are enqueued in task scheduler, having the responsibility to hand out work to unoccupied threads. 4.6 Algorithm description When performing physics simulations, there are a number of tasks which must be performed for every object which is in contact with another to resolve the motion. These tasks are given in Algorithm 1. The variables used in the algorithm are the ones described in Table 4.1, where the values of v 0 and λ 0 are those of v and λ from the previous time step, respectively. 4.7 Hardware: The Intel Xeon Phi Coprocessor The targeted hardware for the computations is the Intel Xeon Phi coprocessor[4, 5], which is a so called Many integrated core -architecture product, for large-scale parallel processing. The Intel Xeon Phi coprocessor is connected to a hosting Intel Xeon as a PCI EXpress

24 14 Chapter 4. Theory U V Figure 4.3: Showing a bipartite graph where objects in U are connected through constraints in V. Start Internal 1 Internal 2 Internal 3 Internal 4 edge 1 edge 2 edge 3 edge 4 edge 5 edge 6 corner 1 corner 2 corner 3 Stop Figure 4.4: A tree of dependence for calculations to be done within each box generated by spatial partitioning.

25 4.7. Hardware: The Intel Xeon Phi Coprocessor 15 Algorithm: Projected Gauss-Seidel Given b, M, G, ɛ, h, f 0 Initialize v v (0) + hm 1 f 0, λ λ 0. Compute blocks d ii b G kbm 1 bb GT kb, for k = 1, 2,..., n c repeat for i = 1, 2,..., n c do r b i + n i v b1 + ɛλ (ν) i if b 2 0 then r r n i v b2 end z max(0, r/d ii + λ (ν) λ (ν+1) i λ (ν+1) i v b1 z λ (ν) i i ) z v b1 + M 1 b 1b 1 G T ib 1 λ i if b 2 0 then v b2 v b2 + M 1 b 2b 2 G T ib 2 λ i end end until time is up Algorithm 1: Algorithm used when simulating particle interactions. Running this algorithm is equivalent of performing GS iterations on the Schur complement stated in (4.17). (PCIe) add-on card. In Table 4.2, a summary of the main properties of Intel Xeon Phi coprocessor compared to properties of Intel Xeon host processor are presented. This hardware is aimed at achieving high throughput performance where the available space and power are constrained. An advantage of this hardware is that the programming is done very similarly to standard x86 architecture, while still having the ability of high parallelization. Standard C, C++, and Fortran source code can, without any modifications, be compiled and run on the coprocessor. This is in contrast to other architectures such as various GPUs, where existing implementations needs to be totally rewritten because of significant differences in how the programming is done. The coprocessor is also supported by a rich development environment which includes compilers, numerous libraries for tasks such as threading and math computation and tools for performance tuning/debugging. Although, there is no real IDE available on the coprocessor, which puts constraints on the code development. The coprocessor is suitable for highly parallel applications where the computation to data access ratio is high. Applications can either be run natively on the coprocessor or, by using instructions in the source code, instruct the application to run some parts on the host processor and send highly parallelizable parts to the coprocessor. The gain of doing this is faster execution in total, since the clock frequency of the host processor is more than two times higher than on the coprocessor. The Xeon Phi coprocessor combines 57 cores on a single chip where the cores are connected to each other with the bidirectional Interprocessor Network (IPN) ring. Each core got 512 bit registers for Single Instruction, Multiple Data (SIMD) instructions, each core bidirectionally connected to the others in a ring based fashion and the L2 cache memory is shared among

26 16 Chapter 4. Theory every group of four hardware threads residing in every core, but the L2 memory as a whole is kept coherent among all cores. This means that the total L2 cache memory size sums up to almost 30MB. If threads in different cores performs work on the same data, each core will have its own local copy of the data. The cores have dual in-order execution pipelines which is more simple than out-of-order pipelines, resulting in threads stalling when data is fetched from memory in comparison with the latter as in the host cores where it is possible to execute other instructions while waiting. [15]. Intel Xeon processor: Coprocessor (model: 3120) Host (model: E5-2670) Core frequency 1.1 GHz 2.6 GHz Number of cores 57 8 Hardware threads per core 4 2 L1 Cache Size kb per core kb L2 Cache Size 31 MB (512kB*62) 2 MB Max. Memory Bandwidth 240 GB/s 51 GB/s Table 4.2: Specification of the Intel Xeon Phi Coprocessor architecture compared to the host architecture. [11] PCIe client logic Core L2 Core L2 Core L2 Core L2 Intel Xeon Processor PCIe GDDR MC TD TD TD TD GDDR MC GDDR MC TD TD TD TD GDDR MC L2 Core L2 Core L2 Core L2 Core System Memory Figure 4.5: Overview of the Intel Xeon Phi architecture. Figure 4.5 shows a graphical description of the coprocessor architecture and how it is related to the rest of the computer system. First,to the left is the Intel Xeon host processor, which the coprocessor is connected to through a PCIe bus. There is a lightweight Linux distribution installed on the card which is accessed through Secure Shell (SSH) from the host processor. There is also a TCP/IP stack implemented over PCIe bus, which allows users to access the coprocessor as a network node. To the right, the main parts of the coprocessor can be seen which are processing cores, caches, GDDR memory controllers and a very high bandwidth, bidirectional ring which interconnects the system. The PCIe client logic and the memory controller provides direct access the memory and the PCIe bus respectively, without any intervention with the host processor. Each core possesses their own L2 cache which is kept coherent by a globally distributed tag directory (TD). The bidirectional ring used for communication actually consists of five independent rings in each direction. The widest, most expensive is the data block ring. This is 64 bytes wide to support the high

27 4.8. Current implementation 17 bandwidth requirement from the many connected cores. The other four rings are smaller; two used for sending read/write commands and memory addresses, and the other two for acknowledgement messages. For the cores to process work as fast as possible it is required to work around the problem of delays in the cache when new data is fetched. In the coprocessor, this is done by allowing four hardware threads to run simultaneously which reduces the wait time [6]. 4.8 Current implementation Using the Xeon Phi architecture, parts of an implementation can be run on a host chip with single-thread performance while other parts, more suitable for parallelization can be executed on the coprocessor. The main factors influencing performance are scalability of algorithms, vectorization of instructions and memory utilization. These properties are of particular interest in the following analysis and discussion regarding the initial implementation and future improvements of the PPGS method used for particle simulations. What follows is explanations of different areas within the context of the PPGS algorithm where proposals for improvements have been stated Cache prefetching There are possibilities to improve performance by applying cache prefetching for certain data in the code. This hides the latencies arising when data is read from memory. Prefetching means that explicit instructions are given to the processor to fetch data from the main memory to the cache just before it is needed. Higher performance may result from this because threads is not forced to stall due to memory delay. By default, hardware prefetching is used, meaning that mechanisms in the hardware are responsible for prefetching. This functionality is triggered when the software tries to access data not found in the cache in which case multiple cache lines is fetched from the main memory. Instead of letting the hardware do the prefetching automatically, instructions in the software code can be used to make prefetching events be based upon knowledge of future memory accesses instead of cache misses Vectorization To get good performance out of the Intel Xeon Phi coprocessor it is necessary to use vectorization instructions. This means taking advantage of the 512 bit wide SIMD registers residing in the cores of the coprocessor. Ways of doing this ranges from using optimized library functions, writing assembly code or calling intrinsic functions that mimic assembly of which the last method is used in this work. This corresponds to, in the code, instead of using the usual mathematical operators such as and +, calling functions which are executed on specialized hardware. Input arguments to these functions are 512 bits of aligned data which corresponds to 8 doubles, or 16 floats. Operations are then carried out pairwise on the values of the input, i.e, for the arguments A = {a 1,..., a n } and B = {b 1,..., b n } and the return value C = {c 1,..., c n }, the n operations a i b i = c i are done simultaneously, where n is 8 for doubles or 16 for floats. Since these operations are executed on specialized hardware, less clock cycles are required, which in this case is equal to faster execution. To give the reader understanding and insight in what intrinsics are, a few examples out of many is given in Listings 4.1 to 4.3.

28 18 Chapter 4. Theory Listing 4.1: Multiply instruction m512d _mm512_mul_pd ( m512d a, m512d b ) ; Listing 4.2: Multiply-add instruction m512d _mm512_fmadd_pd ( m512d a, m512d b, m512d c ) Listing 4.3: Reduce-add instruction double _mm512_reduce_add_pd ( m512d a ) The functions in Listings 4.1 to 4.3 operates on packed double-precision vectors taken as arguments. In Listing 4.1, the function multiplies the elements from a and b and returns the result. The function in Listing 4.2 multiplies elements from a and b, adds elements from c to the product and returns the result. In Listing 4.3, the function sums all the elements in a and returns the result.

29 Chapter 5 Results What follows in this section is the results of what has been accomplished during the work of the thesis. Discussion about the results, the Gauss-Seidel method and iterative methods follows. For analysis and testing of the modifications of the Gauss-Seidel-iteration code, a reference scene consisting of a drum containing 50, ,000 particles has been used. Performance graphs are presented with both time and speedup measurements. For speedup, Equation 5.1 is used, where S is the speedup and T s and T p are the times it take to run the simulation on the coprocessor on a single thread and in parallel, respectively. S = T s T p (5.1) 5.1 Initial implementation The initial implementation of the granular simulation was already done by Algoryx Simulation. For later comparisons with the improvements done in this thesis, the initial implementation was first run on the coprocessor without any modifications. In Figure 5.1 the initial measures of time time can be seen for simulations containing various numbers of particles. As seen, the coprocessor get saturated at 30 threads for the scene containing 500k particles and at 25 threads for the other scenes. Figure 5.1 shows the speedup graph for the initial simulation. For all 3 test scenes, linear speedup is present up to 25 threads, but then decreases. A speedup graph like this was expected and similar results are seen in previous work by R.Meyer[16]. Most likely the reason why no improvements are seen with increased number of threads are because of the increased number of cache misses and the excessive amount of simultaneous data transfers. 5.2 Improvements This section presents the examination of the improvements to be implemented suggested in Section 4.8, some of which have led to great results, while others did not give any significant improvements. 19

30 20 Chapter 5. Results Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.1: Wall clock time graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor Speedup 50k 100k 500k speedup #threads Figure 5.2: Speedup graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor.

31 5.2. Improvements 21 Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.3: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler Task Scheduling The initial implementation of the granular matter simulation used a centralized task scheduler which, on request by any of the many worker threads, sends additional work to be computed by the thread. During the analysis of the simulation process it was discovered that this approach created a bottleneck because a lot of threads requested work at a centralized point causing thread locks. By utilizing the library "Threading building blocks" developed by Intel, the problem could be solved. This was done by promoting all the threads as task schedulers. In this manner, as soon as a thread is finished with all its work, a global request for more work is made and any of the other threads having excess work are able to respond with new tasks. As seen in Figure 5.3 and 5.4 this change gave great improvements. The speedup graph in Figure 5.4 shows almost linear speedup to up to 40 threads but then a clear degradation. It is likely that the reason for this is that the maximum throughput between the processing cores is reached Vectorization In Figure 5.5 and 5.6 measurements of the performance using vectorization instructions are presented. Comparing these results with the previous in Figures 5.3 and 5.4 show almost no improvements. This is explained by the reason that it were too few instructions vectorized and too much logic that disrupted the vectorization pipeline.

32 22 Chapter 5. Results Speedup 50k 100k 500k speedup #threads Figure 5.4: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler. Wall clock time wall clock time (ms) k 100k 500k #threads Figure 5.5: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions.

33 5.2. Improvements Speedup 50k 100k 500k speedup #threads Figure 5.6: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks distributed task scheduler and vectorization instructions.

34 24 Chapter 5. Results

35 Chapter 6 Discussion 6.1 Results in relation to previous work In the results it can be seen that simulation follow linear speedup up to around 40 threads in the case when simulating 500k particles. Compared to the work done in [19] mentioned in Section 1.1, on tests using friction coefficients where speedup deviated from linearity before using 5 processors, the results here are good. Presumably, it is not possible to blindly compare speedup graphs of the two results like this because of the big differences in the initial conditions. The simulations in [19] only used 1000 physical objects compared to 5, 000, 000 in this, meaning that the amount of computations in relation to communication is a lot less with fewer objects. 6.2 Iterative methods in relation to hardware In previous work[21], it has been shown that the bottleneck occurring when solving nontrivially parallel tasks on the Xeon Phi architecture is caused by the constraints on the memory bandwidth. The problem occurs because a big amount of data must be transfered between the cores and storage space, meanwhile the operations performed on the data takes little time in comparison. This is especially true when running PGS, where solutions of tasks continuously must be passed from one thread to other threads performing related tasks. When designing solutions to problems, great care must be taken to balance the processing to data access ratio, to have data set up in a Structure ofarrays(soa) pattern, and preferably having data accesses aligned; fulfilling all of these concepts to get as high performance as possible. Further, the problem arising with the data access bottleneck tells that the system could afford a more computationally intensive solver given the same data access rate. 6.3 Convergence of the projected Gauss-Seidel method A modification of the GS method that increases the calculational work and lessens the communication between threads would fit the Intel Xeon Phi coprocessor hardware better. More research has to be done, but in the best case, this may also improve on the convergence rate, but currently, that is just vivid speculations. Figure 6.1 shows the convergence rate of the currently used GS method measured for the scene containing 500k particles. The x- and y- axes indicates number of iterations and residual of the contact forces, respectively, and it 25

36 26 Chapter 6. Discussion Convergence (500k particles) normal residual tangentu residual tangentv residual residual #iterations Figure 6.1: Convergence rate for the Gauss-Seidel method. is clear that a stagnation of the convergence rate has occurred. If ways are found to increase the rate of convergence, more accurate solutions for the simulations can be calculated within given time constraints. 6.4 Examined methods not giving any improvements What are presented here are investigated ways to improve the simulation software that either did not give any significant improvements or just sporadic performance increase without any clear explanation for it. Cache prefetching Some simulation tests were made where cache prefetching were applied for the particle properties (velocity, mass) and the constraints (Jacobians). Tests were made where all types of the mentioned values were prefetched and also were only some of them were. Speed increases of up to 5% were seen for some tests but no consistency could be derived because a combination of prefetches caused speed up in some cases but not in others, and thats why no results of this are presented. Huge Page Size According to [2] performance may be increased by using huge page sizes for memory allocations on the coprocessor. When an operating system allocates memory, a page is the smallest unit of data which may be allocated. The page size for computers is usually determined by architecture. For the x86-64 architectures, a standard page size is 4KB, and a huge page size is 2MB. By using huge page sizes, performance may be better since sometimes, variables and buffers are then handled more efficiently. Some simulation tests have been run to determine whether using huge page size gives any improvements, but no difference in performance was observed.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 1 Don t you just invert

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Large scale physics simulations in industrial vehicle simulators

Large scale physics simulations in industrial vehicle simulators Large scale physics simulations in industrial vehicle simulators Kenneth Bodin Algoryx Simulation, Umeå, Sweden http://www.algoryx.se/ kenneth@aloryx.se Scope The anatomy, models and methods of a Phyics

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Thesis proposal - förslag på examensarbeten

Thesis proposal - förslag på examensarbeten - förslag på examensarbeten 1. Algorithms and software for co simulation 2. Simulation of mining vehicles and granular crash berms 3. Nonsmooth, analytical models for electric machinery for multidomain

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Chapter 17 - Parallel Processing

Chapter 17 - Parallel Processing Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Samuel Khuvis and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University of Maryland,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Iterative Methods for Linear Systems

Iterative Methods for Linear Systems Iterative Methods for Linear Systems 1 the method of Jacobi derivation of the formulas cost and convergence of the algorithm a Julia function 2 Gauss-Seidel Relaxation an iterative method for solving linear

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Parallel and Distributed Sparse Optimization Algorithms

Parallel and Distributed Sparse Optimization Algorithms Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed

More information

Lecture 3: Camera Calibration, DLT, SVD

Lecture 3: Camera Calibration, DLT, SVD Computer Vision Lecture 3 23--28 Lecture 3: Camera Calibration, DL, SVD he Inner Parameters In this section we will introduce the inner parameters of the cameras Recall from the camera equations λx = P

More information

The determination of the correct

The determination of the correct SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller T6: Position-Based Simulation Methods in Computer Graphics Jan Bender Miles Macklin Matthias Müller Jan Bender Organizer Professor at the Visual Computing Institute at Aachen University Research topics

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Algebraic Graph Theory- Adjacency Matrix and Spectrum

Algebraic Graph Theory- Adjacency Matrix and Spectrum Algebraic Graph Theory- Adjacency Matrix and Spectrum Michael Levet December 24, 2013 Introduction This tutorial will introduce the adjacency matrix, as well as spectral graph theory. For those familiar

More information

Simulation in Computer Graphics. Deformable Objects. Matthias Teschner. Computer Science Department University of Freiburg

Simulation in Computer Graphics. Deformable Objects. Matthias Teschner. Computer Science Department University of Freiburg Simulation in Computer Graphics Deformable Objects Matthias Teschner Computer Science Department University of Freiburg Outline introduction forces performance collision handling visualization University

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Andreas Sandberg 1 Introduction The purpose of this lab is to: apply what you have learned so

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information