Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Size: px
Start display at page:

Download "Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5"

Transcription

1 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH Columbus, OH 4321 Abstract We analyze single-node performance of sparse matrix-vector multiplication by investigating issues of data locality and ne-grained parallelism. We examine the data-locality characteristics of the compressedsparse-row representation and consider improvements in locality through matrix permutation. Motivated by potential improvements in ne-grained parallelism, we evaluate modied sparse-matrix representations. The results lead to general conclusions about improving single-node performance of sparse matrix-vector multiplication in parallel libraries of sparse iterative solvers. 1 Introduction One of the core operations of iterative sparse solvers is sparse matrix-vector multiplication. In order to achieve high performance, a parallel implementation of sparse matrix-vector multiplication must maintain scalability. This scalability comes from a balanced mapping of the matrix and vectors among the distributed processors, a mapping that minimizes interprocessor communication. Load balancing and minimizing communication do not guarantee high performance, however. High single-node performance is also essential. In this work, we extend the analysis of the mapping of matrices and vectors beyond the global multipleprocessor distribution; we consider the memory hierarchy and architecture of a single processor. The local representation of the matrix and vectors could aect performance in two signicant ways, through its impact on data-cache locality and through its eect on ne-grained parallelism for pipelined, superscalar execution. Supported in part by the National Science Foundation under grants DMR and CDA Matrix C A Start index of each row i Column index j Nonzero matrix value a Figure 1: Compressed-sparse-row format (CSR). The de-facto-standard storage format for unstructured sparse matrices is the Compressed-Sparse-Row format (CSR). For a parallel implementation, the rows of a sparse matrix can be distributed among the processors, with the rows local to a processor stored in CSR. Matrix-vector multiplication is then performed using an \owner-computes" strategy. Therefore, we use CSR as a starting point for our analysis. 2 Compressed-Sparse-Row Format The idea behind CSR is to pack each row, storing only the nonzero elements (Figure 1). Each row may have a dierent structure, so the column index for each nonzero element must also be stored. Since each row may have an arbitrary number of nonzero elements, a 2-D array is not an appropriate container for the compressed matrix. Instead, the nonzero elements and the column indices are stored in 1-D arrays, and a third array stores the index of the start of each \row" within these two arrays. With a sparse matrix in CSR, matrix-vector multiplication can be implemented with a simple nested pair of loops; a C implementation is shown in Figure 2. The outer loop is over the rows of the matrix, and the inner loop is over the elements of a particular

2 } for (k = ; k < n; k++) { temp =.; for (l = i[k]; l < i[k+1]; l++) temp += a[l] * x[j[l]]; y[k] = temp; Convex Exemplar SPP12 Processor 12 MHz HP PA-RISC 72 Data Cache 256 kb Operating System SPP-UX 4.2 C Compiler Convex C 6.5 Compiler Options -O2 Table 1: System specications. Figure 2: Sample C code for CSR sparse matrix-vector multiplication. row, as dened by the row-index array, i[..n]. The target vector is y[..n-1], and the source vector is x[..n-1]. The nonzero elements of the matrix are stored in the value array, a[..i[n]-1], and the column indices of the nonzero elements are stored in the column-index array, j[..i[n]-1]. 3 Data Locality To analyze the single-node performance characteristics of CSR, we rst consider data-locality characteristics. Start with the value array, a[l], and the column-index array, j[l]. For a given matrix-vector multiplication, each element of these arrays is used only once; only spatial locality is possible. Since all the elements of each array are used consecutively, this spatial locality is achieved. Next consider the target vector, y[k]. Here, each element is used a number of times, so spatial locality and temporal locality are both possible. Again, all the elements are used consecutively, so spatial locality is achieved. In addition, all of the multiple accesses of a particular element are performed consecutively; temporal locality is also achieved, potentially in the form of CPU-register locality. Problems with locality nally appear in the source vector, x[j[l]]. Since the vector is accessed according to an index array, the access pattern is noncontiguous. Though there is potential for spatial and temporal locality, whether this locality is achieved depends on the particular structure of the sparse matrix. A sparse matrix with a relatively random structure, for example, would yield relatively poor memoryaccess performance. For a particular sparse matrix, it may be possible to improve performance by modifying the structure of the matrix. For parallel sparse matrix-vector multiplication on distributed-memory multi-computers, the amount of communication is signicantly aected by the way that the rows of the matrix are partitioned among the processors. It may be possible to use the same partitioning strategy to more nely partition a sparse matrix on a single node to enhance data locality; the matrix can be renumbered to make elements within a cache-sized partition contiguous. With a partitioning scheme that minimizes references between partitions, most references could be to elements nearby in memory, resulting in enhanced locality. We tested the eectiveness of reordering via partitioning on a number of sample sparse matrices using METIS [3], a popular partitioning program from the University of Minnesota. The tests were performed on a single node of the Convex Exemplar SPP12 at the Ohio Supercomputer Center (Table 1). The SPP12 includes hardware that collects cache usage statistics, and these statistics can be analyzed with the Convex Performance Analyzer. Cache hit rates and execution times were recorded for each matrix before and after partitioning by METIS. The performance results for matrix-vector multiplication for a grid-based test matrix are given in Table 2. The test matrix has nonzero elements, making the system much larger than the SPP12's 256 kb data cache. The results show that reordering by METIS did not improve the cache hit rate or the execution time over a wide range of partition sizes. We also tested large matrices from the Harwell-Boeing collection [1]; the results are summarized in Table 3. The original matrices gave cache hit rates between 9% and 98%, and for no matrix did reordering by METIS improve the hit rate or the execution time. A possible explanation for why reordering by METIS did not improve performance is that the tested matrices already had a signicant level of structure. To determine the validity of this explanation, we randomly renumbered the grid of the original test matrix. We then used METIS to partition and reorder this randomly reordered matrix. The results are given in Table 2. They conrm the explanation; given a matrix with a random structure, a METIS-reordering was indeed able to improve performance. The im-

3 Original Randomly reordered Partitions Performance Cache hit Performance Cache hit (MFLOPS) rate (%) (MFLOPS) rate (%) 1 5:64 94: :49 94: :49 94: :49 94: :44 94: :34 94: Table 2: The eects of METIS partitions on performance for a element, 2-D-grid, 5-point stencil matrix on a single node of the SPP12. Matrix MFLOPS Cache hits bcsstk32.psa BEFORE 8:24 9:6% 1 6 =3 AFTER 8:8 9:5% lns 3937.rua BEFORE 7:87 96:6% 2:5 1 4 =6 AFTER 7:72 96:4% psmigr 2.rua BEFORE 8:64 96:6% =15 AFTER 8:64 96:4% saylr4.rua BEFORE 8:73 97:9% 2:2 1 4 =6 AFTER 8:73 97:5% Table 3: Large Harwell-Boeing matrices tested on a single node of the SPP12. Performance and cache hit rates before and after partitioning with METIS. Below each matrix name is an indication of the number of nonzeros=number of partitions. provement, however, did not bring performance back to the level of the original matrix. We conclude from this that the sparse matrices resulting from real-world problems usually have a signicant level of structure. Toledo [4] reports similar results for partitioning schemes on an IBM RS/6. 4 Fine-Grained Parallelism In the previous section, we analyzed the datalocality characteristics of CSR matrix-vector multiplication. Here, we discuss the other primary issue in single-node performance, ne-grained parallelism. In order for modern superscalar, pipelined processors to operate eciently, they must receive an adequate mix of instructions that can be executed concurrently. For looping algorithms, such as sparse matrixvector multiplication, this requirement means there must be an adequate number of arithmetic operations between the branch statements that perform the loops. The original CSR code falls short in this respect; only! Figure 3: Sorting rows by length. Here, the 1-D value array is represented pictorially be a matrix with jagged rows. a small number of operations are performed at each iteration. The well-established solution to this problem is loop unrolling; each iteration is modied to include operations on multiple elements instead of just one, and the number of iterations is consequently decreased. However, CSR can make loop unrolling relatively dicult and inecient. If the number of nonzero elements per row is small, the inner loop of a CSR code runs over only a few iterations. This small number of iterations may marginalize the benets of inner-loop unrolling. In addition, CSR matrices can have rows with variable length. This variability makes outerloop unrolling almost impossible, and it forces the overhead of inner-loop unrolling to be performed for every row, marginalizing the benets of unrolling even further. Improving ne-grained parallelism for CSR then amounts to improving the eectiveness of loop unrolling. A rst step is improving the structure of the matrix by rearranging the rows according to length (Figure 3). The resulting matrix has a block structure, where each block has equal-length rows. Note that the reordering may have detrimental side eects. If the original matrix has a structure giving good data locality, the reordering may move neighboring rows far apart, eliminating some of this locality. Improvements in ne-grained parallelism must overcome any such loss of locality in order for the performance to improve. The next step is to store the reordered matrix in a form that facilitates loop unrolling. We analyze three options for data representations that may enhance the eectiveness of unrolling. The rst option is column-major CSR (CM-CSR), where the rows are compressed, but the resulting jagged-row matrix is stored in column-major form (Figure 4). The array that stores the beginning of each row in original CSR

4 Row major?! Column major Figure 4: CM-CSR stores the row-sorted matrix in column-major form. BCM-CSR BRM-CSR Figure 5: Blocked matrix formats. instead stores the beginning of each column in CM- CSR. CM-CSR may make inner-loop unrolling more eective because the inner loop is over much longer subsections of the value array. This form is equivalent to the Jagged Diagonal storage format [5], which was designed for vector machines, such as the Cray Y-MP, where short vector operations are relatively inecient. For cache-based architectures, however, the advantage in unrolling may not be able to overcome the disadvantage in data locality. In column-major form, the temporal locality of the target vector may be lost; each element is reused only after all other elements have been accessed. This locality problem is a motivation for the second modied form, blocked column-major CSR (BCM- CSR). After reordering, the matrix is stored in blocks of equal row length, where each block is stored in column-major form (Figure 5). The columns of a block may be short enough that some temporal locality of the target vector can be exploited. Conversely, the columns are likely to be longer than the rows, making inner-loop unrolling more ecient. Matrix nnz nnz/row bcsstk2 1, bcsstk32 2,14, blckhole 8,52 4. crystk2 968, ct2stif 2,698, e5r 5, memplus 126, msc1848 1,229, onetone1 341, watson5 1, Table 4: Sparse matrices tested. The number of nonzero elements in each matrix is nnz, and the average number of nonzero elements per row is nnz/row. The blocking also gives other bonuses. Since all the columns of a block are the same length, outerloop unrolling is practical. For the same reason, innerloop unrolling is more ecient; the overhead for the unrolling may be moved to an outer loop. The nal option for modifying CSR simply takes the blocking of BCM-CSR and applies it to row-major CSR, giving BRM-CSR (Figure 5). BRM-CSR maintains all the locality of the original CSR except what is lost in reordering, and it allows the full range of inner-loop and out-loop unrolling, including moving unrolling overhead to outer loops. 5 Performance Analysis of Alternative Implementations The various modications of CSR have dierent strengths and weaknesses. The relative performance of a particular form can vary according to problem domain and machine type. Therefore, we tested each representation using matrices of various sizes from different application domains, and we tested them on a number of machines. We started with a number of sample sparse matrices from the Matrix Market [1] and the University of Florida collection [2]. The size and average row length for each of these matrices are given in Table 4. As illustrated by the table, the matrices were chosen to span a wide range of row size and total size. For each of these matrices, we timed multiplication using multiple implementations for each representation, with varying levels of inner-loop and outer-loop unrolling (Table 5). For each case, a number of successive matrix-vector multiplications were performed using the same matrix. This was done so that the

5 Unrolling original CM BRM BCM 1 1 X X X X 1 2 X 1 4 X X X 1 8 X X X 4 1 X X X 8 1 X X X 2 8 X improvement as with the systems that t within cache. This is due to the functional unit pipeline stalls for main-memory accesses in case of the large systems. 6 Small Matrices fit into cache Table 5: Levels of unrolling for each form of CSR. The levels of unrolling are displayed as \column-loop unrolling" \row-loop unrolling". For original and BRM, the row loop is the inner loop, while the column loop is the inner loop for CM and BCM measured performance would be representative of its use in a context such as the iterative solution of sparse linear systems where repeated matrix-vector products are performed using the same matrix. We measured double-precision (double) performance on single nodes of four parallel machines at the Ohio Supercomputer Center: a Convex Exemplar SPP12, an IBM SP2, an SGI Power Challenge, and an SGI Onyx. In addition, we measured performance on a single-processor DEC 3/3 AXP workstation. Because of space constraints, we only discuss results for the SPP12 in this paper. Figure 6 has performance plots for execution on the SPP12. Along with the results for original CSR, only the best-performing alternative representations are shown: 1 8 BRM-CSR, 4 1 BRM-CSR, and 1 4 BCM-CSR. Four performance regimes emerge from these gures: small systems with short rows, small systems with long rows, large systems with short rows, and large systems with long rows. To more carefully and systematically evaluate each of these regions, we used synthetic matrices based on nearest-neighbor grid structures. Their size and average row length were varied, as shown in Figure 7. The stencil sizes roughly translate into average row sizes. Again, only the results of the best-performing implementations are shown. The gures illustrate dierences among the four regimes dened above. For systems that t into cache, the alternate representations result in higher absolute performance and greater improvements in performance over CSR. The higher absolute performance comes as no surprise since cache access is much less expensive than main-memory access. For systems that do not t in cache, the benets of improved negrained parallelism do not yield as much performance bcsstk2 blckhole e5r watson5 Large Matrices do not fit into cache bcsstk32 crystk2 ct2stif memplus msc1848 onetone1 Figure 6: Performance of sample sparse matrix-vector multiplication for a single node of the SPP12. The two charts divide the sample systems into those that t into the data cache (< 256 kb) and those that do not. The gures illustrate more subtle dierences between long-row and short-row systems. The opportunities for ne-grained parallelism dier for long and short rows. Because of this, dierent implementations have better characteristics in the two regimes. The results from the other computer systems we tested reveal that the four-regime picture of performance is generally applicable. The split between large and small clearly depends upon cache size, while the split between long and short is consistently around 1 elements.

6 For the SPP12, the results show that a particular single implementation is most eective across each regime. For small-short, 1 8 BRM-CSR consistently performs the best, while 1 4 BCM-CSR is best for large-short, and 4 1 BRM-CSR is best for small-long and large-long. Again, the results for other machines show that this holds across architectures: each regime tends to have a single best implementation. Which implementation proves best for a particular regime is dependent on the architecture, however. The best choices for the Convex SPP12 dier from those for the SGI Power Challenge or IBM SP Performance on 2 D Grid, 5 pt. Stencil Number of nonzero elements 6 Conclusions Single-node performance of sparse matrix-vector multiplication depends on two primary factors: data locality and ne-grained parallelism. Typical sparse matrices, such as those from grid problems or from the Harwell-Boeing collection, already have structures that support data locality. By improving ne-grained parallelism, however, a change in the local representation of a matrix can lead to signicant performance improvements. The optimal choice of representation and implementation depends on three general characteristics: the size of the matrix relative to the machine's data cache, the average number of nonzero elements per row of the matrix, and the particular architecture of the machine. Since the alternative representations we describe only involve transformations of local data, they can be used in parallel libraries with little modication. The choice of most ecient representation is dictated by the easily determined characteristics of the input matrix described above, and no change in the existing interfaces of libraries is required. Therefore, our work can be seen as complementary to other eorts in parallel-iterative- library development. The integration of these performanceenhancing techniques within existing parallel libraries appears to be straightforward, and the performance improvements promise to be signicant Performance on 3 D Grid, 27 pt. Stencil Number of nonzero elements Figure 7: Performance of grid-based sparse matrixvector multiplication for a single node of the SPP12. The two graphs illustrate the dierences between short and long rows. References [1] R. Boisvert, R. Pozo, K. Remington, B. Miller, and R. Lipman, Matrix Market < [2] Tim Davis, Sparse Matrix Project: Directory Index < ~davis/sparse/hb-menu.html>. [3] George Karypis and Vipin Kumar. METIS: Unstructured Graph Partitioning and Sparse Matrix Ordering System: Version 2.. University of Minnesota, Minneapolis, MN, August [4] Sivan Toledo. \Improving Instruction-Level Parallelism in Sparse Matrix-Vector Multiplication using Reordering, Blocking, and Prefetching." Proc. 8th SIAM Conference on Parallel Processing for Scientic Computing, March [5] Yousef Saad. \Krylov Subspace Methods on Supercomputers." SIAM Journal on Scientic and Statistical Computing, 1 (1989), pp

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul

2 do i = 1,n sum = 0.0D0 do j = rowptr(i), rowptr(i+1)-1 sum = sum + a(jp) * x(colind(jp)) end do y(i) = sum end do Fig. 1. A sparse matrix-vector mul Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo y Abstract Sparse matrix-vector multiplication is an important kernel that often runs ineciently on superscalar RISC

More information

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract

Sivan Toledo Coyote Hill Road. Palo Alto, CA November 25, Abstract Improving Memory-System Performance of Sparse Matrix-Vector Multiplication Sivan Toledo Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 9434 November 25, 1996 Abstract Sparse Matrix-Vector

More information

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message. Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s . Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN Tools and Libraries for Parallel Sparse Matrix Computations Edmond Chow and Yousef Saad Department of Computer Science, and Minnesota Supercomputer Institute University of Minnesota Minneapolis, MN 55455

More information

Native mesh ordering with Scotch 4.0

Native mesh ordering with Scotch 4.0 Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse

More information

Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY Eun-Jin Im 1 and Katherine Yelick 2 1 School of Computer Science, Kookmin University, Seoul, Korea ejim@cs.kookmin.ac.kr, 2 Computer

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Level 3: Level 2: Level 1: Level 0:

Level 3: Level 2: Level 1: Level 0: A Graph Based Method for Generating the Fiedler Vector of Irregular Problems 1 Michael Holzrichter 1 and Suely Oliveira 2 1 Texas A&M University, College Station, TX,77843-3112 2 The University of Iowa,

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Ω2

Ω2 CACHE BASED MULTIGRID ON UNSTRUCTURED TWO DIMENSIONAL GRIDS CRAIG C. DOUGLAS, JONATHAN HU y, ULRICH R UDE z, AND MARCO BITTENCOURT x. Abstract. High speed cache memory is commonly used to address the disparity

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for computers on scientic application has been the Linpack

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

arxiv: v1 [cs.ms] 2 Jun 2016

arxiv: v1 [cs.ms] 2 Jun 2016 Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1

More information

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping

More information

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00 PRLLEL SPRSE HOLESKY FTORIZTION J URGEN SHULZE University of Paderborn, Department of omputer Science Furstenallee, 332 Paderborn, Germany Sparse matrix factorization plays an important role in many numerical

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Fast Primitives for Irregular Computations on the NEC SX-4

Fast Primitives for Irregular Computations on the NEC SX-4 To appear: Crosscuts 6 (4) Dec 1997 (http://www.cscs.ch/official/pubcrosscuts6-4.pdf) Fast Primitives for Irregular Computations on the NEC SX-4 J.F. Prins, University of North Carolina at Chapel Hill,

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best

Sparse Matrix Libraries in C++ for High Performance. Architectures. ferent sparse matrix data formats in order to best Sparse Matrix Libraries in C++ for High Performance Architectures Jack Dongarra xz, Andrew Lumsdaine, Xinhui Niu Roldan Pozo z, Karin Remington x x Oak Ridge National Laboratory z University oftennessee

More information

sizes become smaller than some threshold value. This ordering guarantees that no non zero term can appear in the factorization process between unknown

sizes become smaller than some threshold value. This ordering guarantees that no non zero term can appear in the factorization process between unknown Hybridizing Nested Dissection and Halo Approximate Minimum Degree for Ecient Sparse Matrix Ordering? François Pellegrini 1, Jean Roman 1, and Patrick Amestoy 2 1 LaBRI, UMR CNRS 5800, Université Bordeaux

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu

More information

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems.

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors Hazim Abdel-Sha y, Jonathan Hall z, Sarita V. Adve y, Vikram S. Adve [ y Electrical and Computer Engineering

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Seiji Fujino Research Institute for Information Technology, Kyushu University, Fukuoka, Japan, 812-8581 E-mail: fujino@cc.kyushu-u.ac.jp

More information

A. Atamturk. G.L. Nemhauser. M.W.P. Savelsbergh. Georgia Institute of Technology. School of Industrial and Systems Engineering.

A. Atamturk. G.L. Nemhauser. M.W.P. Savelsbergh. Georgia Institute of Technology. School of Industrial and Systems Engineering. A Combined Lagrangian, Linear Programming and Implication Heuristic for Large-Scale Set Partitioning Problems 1 A. Atamturk G.L. Nemhauser M.W.P. Savelsbergh Georgia Institute of Technology School of Industrial

More information

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin On Partitioning Dynamic Adaptive Grid Hierarchies Manish Parashar and James C. Browne Department of Computer Sciences University of Texas at Austin fparashar, browneg@cs.utexas.edu (To be presented at

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract Parallelizing a seismic inversion code using PVM: a poor man's supercomputer June 27, 1994 Abstract This paper presents experience with parallelization using PVM of DSO, a seismic inversion code developed

More information

EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES

EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES RUPAK BISWAS AND LEONID OLIKER y Abstract. Mesh adaption is a powerful tool for ecient unstructured-grid computations but causes load

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Erik G. Boman 1, Umit V. Catalyurek 2, Cédric Chevalier 1, Karen D. Devine 1, Ilya Safro 3, Michael M. Wolf

More information

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185

An Ecient Parallel Algorithm. for Matrix{Vector Multiplication. Albuquerque, NM 87185 An Ecient Parallel Algorithm for Matrix{Vector Multiplication Bruce Hendrickson 1, Robert Leland 2 and Steve Plimpton 3 Sandia National Laboratories Albuquerque, NM 87185 Abstract. The multiplication of

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

2. Modulo Scheduling of Control-Intensive Loops. Figure 1. Source code for example loop from lex. Figure 2. Superblock formation for example loop.

2. Modulo Scheduling of Control-Intensive Loops. Figure 1. Source code for example loop from lex. Figure 2. Superblock formation for example loop. Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs Daniel M. Lavery Wen-mei W. Hwu Center for Reliable and High-Performance Computing University of Illinois, Urbana-Champaign, IL 61801

More information

CACHE-OBLIVIOUS SPARSE MATRIXVECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS

CACHE-OBLIVIOUS SPARSE MATRIXVECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS CACHE-OBLIVIOUS SPARSE MATRIXVECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS A.N. YZELMAN AND ROB H. BISSELING Abstract. In this article, we introduce a cache-oblivious method for sparse

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms Cache Misses Prediction for High Performance Sparse Algorithms B.B. Fraguela R. Doallo E.L. Zapata September 1998 Technical Report No: UMA-DAC-98/ Published in: 4th Int l. Euro-Par Conference (Euro-Par

More information

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. * Design of A Memory Latency Tolerant Processor() Naohiko SHIMIZU* Kazuyuki MIYASAKA** Hiroaki HARAMIISHI** *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. 1117 Kitakaname Hiratuka-shi

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's

Contents 1 Introduction 1 An Overview of the Tensor Product Notation A Tensor Product Formulation of Strassen's Algorithm.1 Block Recursive Strassen's A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm y B. Kumar, C.-H. Huang, and P. Sadayappan Department of Computer and Information Science The Ohio State University R.W. Johnson

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to Appeared in \Proceedings Supercomputing '93" Analytical Performance Prediction on Multicomputers Mark J. Clement and Michael J. Quinn Department of Computer Science Oregon State University Corvallis, Oregon

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

A tabu search based memetic algorithm for the max-mean dispersion problem

A tabu search based memetic algorithm for the max-mean dispersion problem A tabu search based memetic algorithm for the max-mean dispersion problem Xiangjing Lai a and Jin-Kao Hao a,b, a LERIA, Université d'angers, 2 Bd Lavoisier, 49045 Angers, France b Institut Universitaire

More information

Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes

Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes The Harvard community has made this article openly available. Please share how this access benefits

More information

Timeline: Obj A: Obj B: Object timestamp events A A A B B B B 2, 3, 5 6, 1 4, 5, 6. Obj D: 7, 8, 1, 2 1, 6 D 14 1, 8, 7. (a) (b)

Timeline: Obj A: Obj B: Object timestamp events A A A B B B B 2, 3, 5 6, 1 4, 5, 6. Obj D: 7, 8, 1, 2 1, 6 D 14 1, 8, 7. (a) (b) Parallel Algorithms for Mining Sequential Associations: Issues and Challenges Mahesh V. Joshi y George Karypis y Vipin Kumar y Abstract Discovery of predictive sequential associations among events is becoming

More information

A Study of Workstation Computational Performance for Real-Time Flight Simulation

A Study of Workstation Computational Performance for Real-Time Flight Simulation A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies David S. Dixon University of New Mexico, Albuquerque NM 87131, USA Abstract. A friendship game in game theory is a network

More information

Large-scale Structural Analysis Using General Sparse Matrix Technique

Large-scale Structural Analysis Using General Sparse Matrix Technique Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Independent Dimension

Independent Dimension IMPEMENTATION OF A FUY-BAANCED PERIODIC TRIDIAGONA SOVER ON A PARAE DISTRIBUTED MEMORY ARCHITECTURE T. M. Eidson High Technology Corporation Hampton, VA 23681-0001 G. Erlebacher Institute for Computer

More information

Convolutional Neural Networks for Object Classication in CUDA

Convolutional Neural Networks for Object Classication in CUDA Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

Contents 1 Introduction 1 2 Specication of the DCT Mathematical Specication DCT in C

Contents 1 Introduction 1 2 Specication of the DCT Mathematical Specication DCT in C Exploring DCT Implementations Gaurav Aggarwal Daniel D. Gajski Technical Report UCI-ICS-98-10 March, 1998 Department of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425.

More information

Cache Oblivious Matrix Transpositions using Sequential Processing

Cache Oblivious Matrix Transpositions using Sequential Processing IOSR Journal of Engineering (IOSRJEN) e-issn: 225-321, p-issn: 2278-8719 Vol. 3, Issue 11 (November. 213), V4 PP 5-55 Cache Oblivious Matrix s using Sequential Processing korde P.S., and Khanale P.B 1

More information

Optimizing the Performance of Sparse Matrix-Vector Multiplication by Eun-Jin Im Bachelor of Science, Seoul National University, Seoul, 1991 Master of

Optimizing the Performance of Sparse Matrix-Vector Multiplication by Eun-Jin Im Bachelor of Science, Seoul National University, Seoul, 1991 Master of Optimizing the Performance of Sparse Matrix-Vector Multiplication Eun-Jin Im Report No. UCB/CSD-00-1104 June 2000 Computer Science Division (EECS) University of California Berkeley, California 94720 Optimizing

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA A Decoder-based Evolutionary Algorithm for Constrained Parameter Optimization Problems S lawomir Kozie l 1 and Zbigniew Michalewicz 2 1 Department of Electronics, 2 Department of Computer Science, Telecommunication

More information

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques Locality Enhancement for Large-Scale Shared-Memory Multiprocessors Tarek Abdelrahman 1, Naraig Manjikian 2,GaryLiu 3 and S. Tandri 3 1 Department of Electrical and Computer Engineering University oftoronto

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information