Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture

Size: px
Start display at page:

Download "Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture"

Transcription

1 Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture Victoria Sanz 12, Armando De Giusti 134, Marcelo Naiouf 14 1 III-LIDI, Facultad de Informática, Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina. 2 CONICET PhD Scholar 3 CONICET Main Researcher. 4 Full-time Chair Professor. Abstract - In this paper, a performance analysis of a matrix diagonalization algorithm with Jacobi method on a multicore architecture is presented. For this, a block-based sequential algorithm was implemented using the CBLAS library, which improves classic sequential algorithms, and then an algorithm for the parallel resolution of the problem with the shared memory programming tool OpenMP is studied. Next, the experimental work is shown and an improvement in response time is observed as more threads/cores are used. Finally, a performance analysis (Speedup and efficiency) as the dimensions of the input matrix and the number of threads/ cores used increase is presented. Keywords: multicore architecture, parallel programming, matrix diagonalization, Jacobi, OpenMP. 1 Introduction The demand for computational power by a large number of scientific applications has increased so much that the use of parallel platforms for their resolution has become essential. Even though single-core cluster architectures have become usual due to their cost/performance ratio versus large sharedmemory multiprocessor systems, nowadays machines with more than one processor are quite common. Multicore architectures appear as a response to the limitations for increasing speed in single-core processors due to thermal and energy issues. A multicore processor integrates 2 or more cores in a single chip, which means that applications have to be adapted to be able to exploit the parallelism at a thread level provided by this architecture [1]. Similarly, several multicore machines can be connected through a network, which opens the possibility of having clusters with a large number of processors. The Jacobi method to diagonalize symmetric matrixes has applications in fields such as biometrics, artificial vision, digital signal processing, and so on [2][3][4]. As the volume of input data increases, the computation time required increases significantly. The combination of linear algebra libraries optimized for the underlying architecture, together with the power provided by a multicore, and a suitable parallel programming tool for such architecture, will allow reducing execution time. Two of the main performance analysis aspects in a parallel system are the Speedup factor (Sp) [5][6] and the Efficiency (E) that relates the Speedup with the number of processors (P) used [7]. Scalability is a third, very significant factor in parallel applications: problems usually scale, i.e., the volume of work to be done increases, and the architectures used can also scale by increasing the number of processors used. The effect of scaling workload and/or processors on the performance of parallel algorithms, considering Sp and E, is of interest. A system is said to be scalable if it can maintain a constant efficiency with increasing work and processors [5][8]. The purpose of this work is to implement a parallel algorithm to diagonalize matrixes with Jacobi method, using the shared-memory programming tool OpenMP [9] and the Linear Algebra CBLAS library [10] in order to exploit the computation power of a multicore machine. Lastly, experimental tests are presented and an analysis of the performance obtained as the size of the matrix and the number of cores scale, is carried out. This paper is organized as follows: Section 2 includes a description of Jacobi method for real symmetric matrixes; Section 3 presents various implementations of the classic sequential algorithm, two implementations of the block-based algorithm (one using the CBLAS library), and proposes the parallelization of the block-based algorithm using OpenMP; Section 4 includes a comparative analysis of the execution times for the various sequential implementations with matrixes of different sizes, concluding that the block-based algorithm that uses CBLAS performs better, and it presents experimental evidence and an analysis of the performance (speedup, efficiency) obtained with the parallel algorithm as the input matrix size and the number of threads/cores used are increased. 2 Description of Jacobi Diagonalization Method The problem of diagonalizing a symmetric square matrix S consists in finding an orthogonal matrix X that causes S to be

2 reduced to the diagonal form. Jacobi method allows finding a matrix X such that: X T *S*X = D. The elements in the diagonal of matrix D are called eigenvalues of S, and the columns of matrix X are the eigenvectors of S Specific Case: Diagonalization of a 2x2 Symmetric Matrix If S is a 2x2 symmetric matrix (Fig. 1.a), it can be diagonalized by means of the orthogonal matrix X that is shown in Fig. 1.b. This matrix is known as rotation matrix. S q,q = S p,p * sin 2 α + 2 *S p,q *sinα*cosα + S q,q *cos 2 α Notice that S p,p, S p,q and S q,q are the values in S before this step. This notation will be used from here on. 2. Cancel out S p,q and S q,p. 3. Replace the value in matrix S with the result obtained with the operation J T *S*J (omitting the positions updated in previous steps). 4. Replace the value in matrix X with the result obtained with the operation X*J. ( ) ( ) a Fig. 1. a) 2x2 symmetric matrix S. b) Orthogonal matrix X Since the purpose of the method is obtaining D, the values of cosine α and sine α must be such that the elements outside the diagonal of D are cancelled out. These values are calculated as shown in Fig. 2. b ( ) Fig. 3. Rotation matrix J for the general diagonalization case. Thus, matrixes J T i and J i will be multiplied by S until reaching the diagonal form. Matrix X is the result of the product of the J i matrixes, and X T is the result of multiplying the J T i matrixes (Fig. 4). [ [ ] ] Fig. 2. Calculation of the values sinα and cosα 2.2 General Case: Diagonalization of an nxn Symmetric Matrix The general method [11] keeps a matrix X initialized as the identity matrix, where the eigenvectors of S will remain, and in each stage performs a series of rotations that update S and X until the maximum element in the upper triangle of S is lower than a selected threshold. In each stage, for each element S p,q in the upper triangle, a rotation that cancels out such element is performed. The rotation matrix J (Fig. 3) will have the values cosine α, sine α, -sine α and cosine α at positions J p,p, J p,q, J q,p, and J q,q, respectively; it will also have a value of 1 on its diagonal, and 0 on all remaining elements. The steps to follow for each rotation are: 1. Calculate the elements S p,p and S q,q as follows S p,p = S p,p * cos 2 α 2*S p,q *sinα*cosα + S q,q * sin 2 α Fig. 4. Multiplication of S and the rotation matrixes to obtain the diagonal matrix. 3 Implementations of the Algorithm for Jacobi Diagonalization Method 3.1. Classic Sequential Algorithm: The algorithm to solve the diagonalization problem does not need to have matrix J stored. The updates carried out in S and X during steps 3 and 4 of the method described in Section 2.2 are replaced by the following operations: S i,p = S p,i = S i,p* cosα S i,q* sinα for i=1..n, i p, i q S i,q = S q,i = S i,q* cosα + S i,p* sinα for i=1..n, i p, i q X i,p = X i,p *cosα X i,q *sin α for i=1..n X i,q = X i,q *cosα + X i,p *sin α for i=1..n That is, during step 3, rows p and q and columns p and q in matrix S are updated; while during step 4, columns p and q in matrix X are updated. Two optimizations can be carried out in relation to the storage of matrixes S and X in memory:

3 - Since S is a symmetric matrix, it is possible to store only n*(n+1)/2 elements, n being the row and columns dimension of S, which allows avoiding unnecessary operations due to symmetry. - Since access to X is always through its columns, if the elements are stored in memory in column-major order, a better performance will be obtained. This is because every time the memory is accessed to look for an element in X, a block of elements that will be used in the short term is moved to cache memory. This is known as optimization by spatial locality. 3.2 Block-Based Sequential Algorithm: The development of new architectures requires the implementation of new techniques so that algorithms adapt to the underlying platform and thus obtain a better performance. The classic sequential algorithm described in Section 3.1 can be adapted to use linear algebra optimized libraries. In order to use the matrix operations available in the Level 3 CBLAS library, a sequential algorithm was implemented for the diagonalization of matrixes using the block-based Jacobi method, which will be the basis for the parallel algorithm; for this reason, the matrix S is stored in full. This algorithm considers that the dimension of matrix S is n = N*r (Fig. 5), where N is the number of blocks in each dimension, and r is the dimension of each block. This algorithm is similar to the classic one. Each element S P,Q, with P,Q=0..N-1, will denote a block. Within the block, the elements will be referenced with indexes (p,q), where p,q = 0..r-1. [12] ( ) Fig. 5. Matrix S, whose dimension is n=n*r. The method keeps a matrix X initialized as the identity matrix, organized by blocks the same as S, where the eigenvectors of S will remain. Each stage of the algorithm performs a series of rotations that update S and X, until the maximum element in the upper triangle of S is lower than a threshold previously selected. The operations in any given stage consist in cancelling out each block S P,Q in the upper block triangle in S. The steps to follow for each rotation are: 1. Calculate Jacobi with the classic method over the 2x2- block matrix formed by S P,P, S P,Q, S Q,P and S Q,Q. The eigenvectors will be in a 2x2-block matrix called auxx. 2. Calculate the transposition of auxx, transpx. 3. Apply rotations to matrix S a. S I,P = S I,P *auxx 0,0 + S I,Q * auxx 1,0 for I=0..N, I P Q b. S I,Q = S I,P *auxx 0,1 +S I,Q * auxx 1,1 for I=0..N, I P Q c. S P,I = transp 0,0 *S P,I + transp 0,1 *S Q,I for I=0..N, I P Q d. S Q,I = transp 1,0 *S P,I + transp 1,1 *S Q,I for I=0..N, I P Q 4. Apply rotations to matrix X a. X I,P = X I,P *auxx 0,0 + X I,Q *auxx 1,0 for I=0..N b. X I,Q = X I,P *auxx 0,1 + X I,Q *auxx 1,1 for I=0..N Implementation with CBLAS. In the block-based algorithm, the updates to matrixes S and X (steps 3.a, 3.b, 3.c, 3.d, 4.a, and 4.b) involve a sequence of algebraic operations over rxr-sized blocks or matrixes, and copies of blocks to keep auxiliary data that are not carried out in the classic algorithm. The use of the existing linear algebra libraries allows performing matrix operations (Level 3 CBLAS), achieving the best performance for a given architecture. [10] The function cblas_dgemm was used to reduce block multiplication and addition processing time, and cblas_dcopy was used to optimize block copying Block-Based Parallel Algorithm with CBLAS: From Section 3.2, it can be observed that a rotation carried out to cancel out an element S P,Q will update rows and columns with index P and Q in S, and columns P and Q in X. Also, if we have two coordinate pairs (P,Q) and (P,Q ), with P Q P Q, J P,Q * J P,Q = J P,Q * J P,Q. Based on these two premises, it is concluded that two or more positions can be canceled out in parallel, provided that their indexes are disjoint [13][14]. The parallelization strategy consists in carrying out the rotations simultaneously in each algorithm stage. The number of parallel tasks to be run in each stage is N/2, since their coordinates must be different, and their indexes will be calculated following the Chess Tournament strategy [15]. If there are more tasks than threads, they will be equally distributed. A task whose coordinate is (P,Q) consists in calculating the eigenvectors and eigenvalues of the 2x2-block submatrix formed by S P,P, S P,Q, S Q,P, S Q,Q with the classic sequential algorithm, independently. Then, matrixes S and X are updated as follows: - Rotations in S: since there may be several threads in this update stage, they must wait before starting (barrier synchronization); then, each of their columns will be updated (since each thread has coordinates whose indexes are disjoint). Then, after a second barrier synchronization stage, the rows in S are updated. The first synchronization prevents a clash between delayed threads currently updating rows in S and other threads that are about to

4 update the columns corresponding to the next tasks. The second synchronization prevents conflicts in case one thread is modifying its columns and another thread is updating its rows. - Rotations in X: these rotations modify only the columns and, since all tasks have different coordinates, they can be done in parallel. After carrying out these rotations, the threads can move on to complete their following task, and so forth until the current stage is finished. Next, the maximum of the upper triangle of matrix S is calculated in parallel, and then one of the threads calculates the set of indexes for the parallel tasks in the next stage (following the Chess Tournament strategy). This last step must be done sequentially because of data dependencies. For the following step, each thread independently verifies the termination condition, from the maximum value of S obtained before, and, based on the result obtained, the algorithm is ended or a new stage is started Implementation with OpenMP. OpenMP [9] is an API for the C, C++ and Fortran languages that allows writing and running parallel programs using shared memory and offers the possibility of creating a set of threads that work concurrently to exploit the advantages of multicore architectures. The implementation of the parallel algorithm to solve Jacobi method with OpenMP loads the input matrix S and sequentially initializes matrix X. Next, a set of threads is created, as many as cores are available in the multicore architecture, and they initialize in parallel the structure with the indexes of the tasks to be run in the first stage. In each stage of the algorithm, each thread will take a consecutive set of parallel task indexes and will follow the steps described in Section 3.3. When all the tasks in a stage have been executed, the threads calculate in parallel the maximum of the upper triangle in S by applying a reduction operation. Then, one of them updates the set of parallel task indexes for the next stage, following the Chess Tournament strategy. 4 Results Tests were carried out in a machine with two Intel Quad Core Xeon E5405, 2.0Ghz processors. Each core has a 32-Kb L1 cache for data and a 32-Kb cache for instructions. Each pair of cores shares a 6-Mb L2 cache. Available memory is 10Gb RAM. All times were measured with the function omp_get_wtime from the omp.h library. The tests that are compared in Sections 4.1, 4.2 and 4.3 were carried out over matrixes with identical data and identical precision (0.0001), since convergence times depend on the input data and the precision selected. The same is valid for Section 4.4, which assesses the scalability of the parallel algorithm Classic Sequential Algorithm Four versions with different methods for storing matrix S and matrix X were implemented: - Version 1: it stores the entire S by rows and X by rows - Version 2: it stores the entire S by rows and X by columns. Optimization by spatial locality in X. - Version 3: it stores only the upper triangle for S, and it stores X by rows. Storage optimization for S and lower number of operations by symmetry. - Version 4: it stores only the upper triangle for S, and it stores X by columns. Table 1 shows the execution times for each of the versions of the classic algorithm, and matrixes with n=100,200,300,400,500,600,700,800,900,1000. Versions 2 and 3 improve times compared to version 1, the optimization implemented for version 3 being the one with the greatest impact. For this reason, when both optimizations are incorporated in version 4, execution times are reduced even more. Table 1. Sequential algorithm times with classic Jacobi for various optimizations. Dimension / Version Version 1 Version 2 Version 3 Version Block-Based Sequential Algorithm For the tests with block-based sequential algorithms, the same matrixes from Section 4.1 were used, but their data were organized in consecutive blocks whose possible size will depend on the size of the matrix Block-Based Sequential Algorithm (without CBLAS) The block-based algorithm that does not use CBLAS did not improve the times obtained by the classic algorithm version 4. As the tests included in Table 2 show, the best

5 response time is obtained with the maximum possible block size. In those cases, the block-based algorithm behaves very similar to the classic algorithm, since it carries out only one stage to obtain the eigenvalues and eigenvectors of the 2x2- block matrix formed by S 0,0, S 0,1, S 1,0 and S 1,1 applying the classic algorithm, obtaining in this way the eigenvalues and eigenvectors of the entire matrix. The reason for which the block-based algorithm without CBLAS did not improve response times is due to the extra time used to multiply and copy blocks, and as block size decreases, a larger number of these operations is required Block-Based Sequential Algorithm with CBlas The block-based sequential algorithm that uses the CBLAS library improved the response times of the sequential algorithms assessed in the previous sections. In most of the cases, the best time was obtained with a block size of 10, as shown in Table 3. To analyze the cause of this improvement, the number of stages carried out by the block-based algorithm in its two versions and the average time needed to perform a rotation were calculated. These tests were done for the matrix of size 1000 with block size 10 and 500. Both algorithms performed an only stage for block size 500, and 10 stages when using the block size 10. This proves that, when block size is reduced, the algorithm will need more iterations to converge. However, the average time needed for each rotation is different. Taking into account that in each stage, N*(N+1)/2 rotations are done, it is observed that: - In the case of the block-based algorithm without CBLAS, the average rotation time for the matrix with block-size 10 is The number of rotations in each stage is Therefore, each stage will take approximately 56.4 seconds, which explains the time measured during the test. On the other hand, the average rotation time for the matrix with block-size 500 is This time is similar to the final time obtained, and also to the time obtained with the classic algorithm (version 1). This is because there is only one stage and a single rotation. - In the case of the block-based algorithm with CBLAS, when using block-size 10, the average rotation time is ; since there are 5050 rotations in each stage, each will take approximately seconds, which explains the time measured during the test. When using block- size 500, the iteration time was The improvement achieved with the block-based algorithm that uses CBlas is due to the use of the functions cblas_dgemm and cblas_dcopy when calculating rotations. The size of the block that optimizes times must be such that calculating the classic Jacobi for a 2x2-block matrix is not so expensive, since times that are similar to those of version 1 of the classic algorithm will be obtained. The exact ratio must be found, since very small blocks will reduce the average rotation time but will increase the number of rotations. For example: for the block-based algorithm with CBLAS, the average rotation time for the test with the matrix of size 1000 and block-size 5 is The algorithm carried out 10 stages and rotations in each stage. Each stage in average lasted 24.3 seconds, which ended up in an increase in time. Table 2. Times obtained with the block-based sequential algorithm without CBLAS. Dimension / Block Size

6 Table 3. Times obtained with the block-based sequential algorithm with CBLAS. Dimension / Block Size Table 4. Times obtained with the block-based sequential algorithm that uses CBLAS for matrixes whose n is power of 2. Matrix Dimension Block Size Block-Based Parallel Algorithm with CBLAS Since the scalability analysis was carried out with 2, 4, and 8 threads/cores, matrixes whose size is power of 2 were used, so that the tasks were proportionally distributed among the threads. Table 4 shows the execution times of the block-based sequential algorithm that uses CBLAS for those matrixes, varying the size of the blocks within the possible values. The test that optimized the execution time for each matrix size is highlighted. Table 5 shows the speedup (Sp= Sequential Time / Parallel Time) and the efficiency obtained (E= Sp / Number of processors) in the parallel tests. These results show that, if matrix size is kept the same and the number of processors is increased, the speedup obtained is better, that is, the problem is solved in less time. This improvement, however, does not keep a constant efficiency. The decrease in efficiency is due to the overhead of thread creation, synchronization (barriers), and the sequential portion of the problem (the structure with indexes of parallel tasks must be updated in each stage following a Chess Tournament strategy in a sequential manner). When scaling the problem and keeping the same number of processors, as shown in Table 5 and Fig. 6, both speedup and efficiency improve in general, since the overhead mentioned above is less significant in the total processing time. Table 5. Speedup and Efficiency values as matrix size and threads/cores are increased. Speedup Efficiency Dimension (r) / Parallel Tasks (8) (8) (16) (16) (16) (64)

7 Speedup Efficiency 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 Speedup Matrix Dimension ,00 0,80 0,60 0,40 0,20 0,00 Efficiency Matrix Dimension Fig. 6. a) Speedup as matrix size and the number of threads/cores are increased b) Efficiency as matrix size and the number of threads/cores are increased. The behavior described is typical of a scalable parallel system, where efficiency can be maintained at a constant value by simultaneously increasing the number of cores/processors and the size of the problem. 5 Conclusions and Future Work Various sequential algorithms for the resolution of the matrix diagonalization problem were presented and one parallel implementation that exploits the power offered by multicore architectures was introduced. The resulting execution times were analyzed for each implementation, and it was observed that the best performance corresponds to the algorithms that use libraries optimized for linear algebra (CBLAS Level 3). It was also observed that performance improves when the algorithm is parallelized on a multicore architecture. In recent years, GPUs (Graphic Processing Unit) [16] have gained significance due to the high performance achieved in general-purpose applications. One of the future lines of work is based on the migration of Jacobi diagonalization algorithm to be run on GPU, and then systematically study the performance achieved as the size of the problem and the number of threads are increased. Also, an energy consumption analysis for the execution of this parallel algorithm on various multicore architectures is proposed [17]. 6 References [1] Olukotun K, Hammond L, Laudon J. Chip Multiprocesor Architecture. Synthesis Lecture on Computer Science, Morgan&Claypool; [2] Turk M., Pentland A. Eigenfaces for recognition. Journal of Cognitive Neuroscience Vol 3, No. 1, pp ; [3] Bravo Muñoz I. Arquitectura basada en FPGAs para la detección de objetos en movimiento, utilizando visión computacional y técnicas PCA. Doctoral Dissertation, Universidad de Alcalá; [4] Brent R., Parallel Algorithms for Digital Signal Processing. Numerical Linear Algebra. Digital Signal Processing and Parallel Algorithms, pp Springer, Heidelberg; [5] Grama A., Gupta A., Karypis G., Kumar V. Introduction to Parallel Computing. Second Edition. Addison Wesley; [6] Leopold C. Parallel and distributed computing. A survey of models, paradigms, and approaches. Wiley; [7] Quiin M. J. Parallel Computing: Theory and Practice. McGraw-Hill Companies; [8] Hwang K. Advanced Computer Architecture. Parallelism, Scalability, Programmability. McGraw Hill; [9] The OpenMP API specification. [10] BLAS Basic Linear Algebra Subprograms. [11] Rutishauser H. The Jacobi Method for Real Symmetric Matrices. Numer. Math. 9, 1-10; [12] Hansen, E. On Jacobi methods and block-jacobi methods for computing matrix eigenvalues. Doctoral Dissertation, Stanford; 1960 [13] Sameh A. On Jacobi and Jacobi like algorithms for a parallel computer. Mathematics of computation, Vol 25, No. 115, [14] Shroff G. A parallel algorithm for the eingenvalues and eigenvectors of a general complex matrix. Research Institute for Advanced Computer Science. NASA Ames Research Center. Technical Report.; 1989 [15] Bischof C., Van Loan C. Computing the singular value decomposition on a ring of array processors. Proceedings of the IBM Europe Institute Workshop on Large Scale Eigenvalue Problems. Vol. 127, pp 51 66; [16] General-Purpose Computation on Graphics Hardware [17] Wu-chun Feng, Xizhou Feng & Rong Ge. Green Supercomputing Comes of Age Journal IT Professional. Vol. 10, No. 1, pp 17-23; 2008.

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Parallelization of the N-queens problem. Load unbalance analysis.

Parallelization of the N-queens problem. Load unbalance analysis. Parallelization of the N-queens problem Load unbalance analysis Laura De Giusti 1, Pablo Novarini 2, Marcelo Naiouf 3,Armando De Giusti 4 Research Institute on Computer Science LIDI (III-LIDI) 5 Faculty

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Fernando Tinetti 1, Emilio Luque 2 1 Universidad Nacional de La Plata Facultad de Informática, 50 y 115 1900 La Plata, Argentina

More information

A GPU Implementation of a Jacobi Method for Lattice Basis Reduction

A GPU Implementation of a Jacobi Method for Lattice Basis Reduction A GPU Implementation of a Jacobi Method for Lattice Basis Reduction Filip Jeremic and Sanzheng Qiao Department of Computing and Software McMaster University Hamilton, Ontario, Canada June 25, 2013 Abstract

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Parallel Linear Algebra on Clusters

Parallel Linear Algebra on Clusters Parallel Linear Algebra on Clusters Fernando G. Tinetti Investigador Asistente Comisión de Investigaciones Científicas Prov. Bs. As. 1 III-LIDI, Facultad de Informática, UNLP 50 y 115, 1er. Piso, 1900

More information

Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Clusters

Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Clusters Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Yili TSENG Department of Computer Systems Technology North Carolina A & T State University Greensboro, NC 27411,

More information

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL Journal of Theoretical and Applied Mechanics, Sofia, 2013, vol. 43, No. 2, pp. 77 82 OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL P. Dobreva Institute of Mechanics, Bulgarian

More information

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm Rashmi C a ahigh-performance Computing Project, Department of Studies in Computer Science, University of Mysore,

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

Consultation for CZ4102

Consultation for CZ4102 Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I was a programmer from to. I have been working in

More information

arxiv: v1 [cs.dc] 2 Apr 2016

arxiv: v1 [cs.dc] 2 Apr 2016 Scalability Model Based on the Concept of Granularity Jan Kwiatkowski 1 and Lukasz P. Olech 2 arxiv:164.554v1 [cs.dc] 2 Apr 216 1 Department of Informatics, Faculty of Computer Science and Management,

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Face recognition based on improved BP neural network

Face recognition based on improved BP neural network Face recognition based on improved BP neural network Gaili Yue, Lei Lu a, College of Electrical and Control Engineering, Xi an University of Science and Technology, Xi an 710043, China Abstract. In order

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Parallel Implementation of the NIST Statistical Test Suite

Parallel Implementation of the NIST Statistical Test Suite Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

GPU Based Face Recognition System for Authentication

GPU Based Face Recognition System for Authentication GPU Based Face Recognition System for Authentication Bhumika Agrawal, Chelsi Gupta, Meghna Mandloi, Divya Dwivedi, Jayesh Surana Information Technology, SVITS Gram Baroli, Sanwer road, Indore, MP, India

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

An Efficient Secure Multimodal Biometric Fusion Using Palmprint and Face Image

An Efficient Secure Multimodal Biometric Fusion Using Palmprint and Face Image International Journal of Computer Science Issues, Vol. 2, 2009 ISSN (Online): 694-0784 ISSN (Print): 694-084 49 An Efficient Secure Multimodal Biometric Fusion Using Palmprint and Face Image Nageshkumar.M,

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL: Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I have been working in NUS since 0, and I teach mainly

More information

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University

More information

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

A Parallel Sweep Line Algorithm for Visibility Computation

A Parallel Sweep Line Algorithm for Visibility Computation Universidade Federal de Viçosa Departamento de Informática Programa de Pós-Graduação em Ciência da Computação A Parallel Sweep Line Algorithm for Visibility Computation Chaulio R. Ferreira Marcus V. A.

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD

MATRIX-VECTOR MULTIPLICATION ALGORITHM BEHAVIOR IN THE CLOUD ICIT 2013 The 6 th International Conference on Information Technology MATRIX-VECTOR MULTIPLICATIO ALGORITHM BEHAVIOR I THE CLOUD Sasko Ristov, Marjan Gusev and Goran Velkoski Ss. Cyril and Methodius University,

More information

A Parallel Algorithm for Finding Sub-graph Isomorphism

A Parallel Algorithm for Finding Sub-graph Isomorphism CS420: Parallel Programming, Fall 2008 Final Project A Parallel Algorithm for Finding Sub-graph Isomorphism Ashish Sharma, Santosh Bahir, Sushant Narsale, Unmil Tambe Department of Computer Science, Johns

More information

The p-sized partitioning algorithm for fast computation of factorials of numbers

The p-sized partitioning algorithm for fast computation of factorials of numbers J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

NOVEL APPROACHES IN IMPLEMENTING THE LEGENDRE SPECTRAL-COLLOCATION METHOD USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

NOVEL APPROACHES IN IMPLEMENTING THE LEGENDRE SPECTRAL-COLLOCATION METHOD USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE U.P.B. Sci. Bull., Series A, Vol. 78, Iss. 3, 2016 ISSN 1223-7027 NOVEL APPROACHES IN IMPLEMENTING THE LEGENDRE SPECTRAL-COLLOCATION METHOD USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE Dana-Mihaela PETROŞANU

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

First, the need for parallel processing and the limitations of uniprocessors are introduced.

First, the need for parallel processing and the limitations of uniprocessors are introduced. ECE568: Introduction to Parallel Processing Spring Semester 2015 Professor Ahmed Louri A-Introduction: The need to solve ever more complex problems continues to outpace the ability of today's most powerful

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Outline of High-Speed Quad-Precision Arithmetic Package ASLQUAD

Outline of High-Speed Quad-Precision Arithmetic Package ASLQUAD Outline of High-Speed Quad-Precision Arithmetic Package ASLQUAD OGATA Ryusei, KUBO Yoshiyuki, TAKEI Toshifumi Abstract The ASLQUAD high-speed quad-precision arithmetic package reduces numerical errors

More information

An FPGA Implementation of the Hestenes-Jacobi Algorithm for Singular Value Decomposition

An FPGA Implementation of the Hestenes-Jacobi Algorithm for Singular Value Decomposition 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops An FPGA Implementation of the Hestenes-Jacobi Algorithm for Singular Value Decomposition Xinying Wang and Joseph Zambreno

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Introduction to Parallel. Programming

Introduction to Parallel. Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,

More information

Parallel Evaluation of Hopfield Neural Networks

Parallel Evaluation of Hopfield Neural Networks Parallel Evaluation of Hopfield Neural Networks Antoine Eiche, Daniel Chillet, Sebastien Pillement and Olivier Sentieys University of Rennes I / IRISA / INRIA 6 rue de Kerampont, BP 818 2232 LANNION,FRANCE

More information

Semi-Supervised PCA-based Face Recognition Using Self-Training

Semi-Supervised PCA-based Face Recognition Using Self-Training Semi-Supervised PCA-based Face Recognition Using Self-Training Fabio Roli and Gian Luca Marcialis Dept. of Electrical and Electronic Engineering, University of Cagliari Piazza d Armi, 09123 Cagliari, Italy

More information

Color Space Projection, Feature Fusion and Concurrent Neural Modules for Biometric Image Recognition

Color Space Projection, Feature Fusion and Concurrent Neural Modules for Biometric Image Recognition Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 286 Color Space Projection, Fusion and Concurrent Neural

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs By Shuotian Chen Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Adviser: Professor Volodymyr

More information

Applications Video Surveillance (On-line or off-line)

Applications Video Surveillance (On-line or off-line) Face Face Recognition: Dimensionality Reduction Biometrics CSE 190-a Lecture 12 CSE190a Fall 06 CSE190a Fall 06 Face Recognition Face is the most common biometric used by humans Applications range from

More information

Study of Butterfly Patterns of Matrix in Interconnection Network

Study of Butterfly Patterns of Matrix in Interconnection Network International Journal of Scientific & Engineering Research, Volume 7, Issue, December-6 3 ISSN 9-558 Study of Butterfly Patterns of Matrix in Interconnection Network Rakesh Kumar Katare Professor, Department

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis Hannes Fassold, Jakub Rosner 2014-03-26 2 Overview GPU-activities @ AVM research group SIFT descriptor extraction Algorithm GPU implementation

More information

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction CSE 262 Spring 2007 Scott B. Baden Lecture 1 Introduction Introduction Your instructor is Scott B. Baden, baden@cs.ucsd.edu Office: room 3244 in EBU3B Office hours: Tuesday after class (week 1) or by appointment

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Parallel Computation of the Singular Value Decomposition on Tree Architectures

Parallel Computation of the Singular Value Decomposition on Tree Architectures Parallel Computation of the Singular Value Decomposition on Tree Architectures Zhou B. B. and Brent R. P. y Computer Sciences Laboratory The Australian National University Canberra, ACT 000, Australia

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

A dedicated kernel named TORO. Matias Vara Larsen

A dedicated kernel named TORO. Matias Vara Larsen A dedicated kernel named TORO Matias Vara Larsen Who am I? Electronic Engineer from Universidad Nacional de La Plata, Buenos Aires, Argentina. Argentina PhD in Computer Science at INRIA / CNRS, Nice, France

More information

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing Da-Qi Ren and Reiji Suda Department of Computer Science Energy aware SIMD/SPMD program design framework GOAL:

More information

Geometric Mean Algorithms Based on Harmonic and Arithmetic Iterations

Geometric Mean Algorithms Based on Harmonic and Arithmetic Iterations Geometric Mean Algorithms Based on Harmonic and Arithmetic Iterations Ben Jeuris and Raf Vandebril KU Leuven, Dept. of Computer Science, 3001 Leuven(Heverlee), Belgium {ben.jeuris,raf.vandebril}@cs.kuleuven.be

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information