*Yuta SAWA and Reiji SUDA The University of Tokyo

Size: px
Start display at page:

Download "*Yuta SAWA and Reiji SUDA The University of Tokyo"

Transcription

1 Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi, Ltd.

2 Background Multi-Core CPUs in personal computers

3 CPUs on personal computers Year: Fastest Intel CPU (for PC) Cores (thread) Pentium Pentium 4 Pentium D Core i (8) GFLOPS How many dimensions of matrix multiplications can be solved in a second About 5 times in the dimension, and about 125 times in FLOPS

4 Users paradigm shift about usage of computers A decade ago, users ran applications sequentially, or the performance of the application was decreased CPU CPU Application App1 App2 time App1 App2 time Running sequentially Today, users can run a few applications concurrently without much overhead or Core App1 CPU Running concurrently disturbing each other Core App2 time

5 BLAS (Basic Linear Algebra Subprograms) BLAS are interfaces of multiplication routines of vectors and matrices BLAS are basic routine in numerical calculations BLAS are called from many other libraries such as LAPACK ARPACK LAPACK ScalaPACK PBLAS BLAS

6 DGEMM routine in BLAS DGEMM is the simplest double-precision matrix multiplication written as follows: m C := αab + βc k m n C := α A B + β C n k Computation complexity of DGEMM routine is O(mnk) This DGEMM routine is our target problem in this paper

7 BLAS on personal computers DGEMM routine have much parallelism In personal computers, the number of available CPU cores changes depends on other applications The number of physical cores p = 2 Core App 1 CPU Core Available cores for BLAS 1 2 Application App 1 App 2 Two parameters about the number of CPU cores time 2

8 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning time ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems ATLAS for large problems CPU Core Core Core Core CPU Core Core Core Core BLAS BLAS sub task sub task time BLAS BLAS BLAS BLAS sub task sub task sub task sub task time

9 Importance of Dynamic Load Balancing In traditional BLAS implementations, the size of each sub task was almost the same Core BLAS CPU Core Core BLAS BLAS Core BLAS sub task sub task sub task sub task Sub tasks with the same size If there are other tasks running concurrently, using dynamic load balancing is better time Sub tasks with the same size Dynamic load balancing CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task BLAS sub task time CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task sub task time

10 Dynamically Load-Balanced BLAS (DL-BLAS)

11 What is DL-BLAS? DL-BLAS is one of BLAS implementations DL-BLAS is well parallelized even if there are other applications running concurrently CPU Core Core Core Core App 1 BLAS sub task BLAS sub task BLAS sub task BLAS sub task App 2 Time Other Application

12 DL-BLAS parallelization algorithm DL-BLAS split a calculation into some tasks, and assign them to CPUs using dynamic load balancing Each CPU calculates tasks using BLAS implementation Original Calculation 1 2 Split into sub-tasks DL-BLAS provides Core 2 2 Core 1 1 Core 1 to cores Core n Assign sub-tasks Calculate using ATLAS Each core calculates sub-tasks

13 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

14 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

15 DL-BLAS task splitting DGEMM calculation can be written as C = αab + βc A is partitioned by (n b, k b ) matrices, B is by (k b, n b ) and C is be (n b, n b ) m k m n C += n A k B Block size: (n b, k b ) n b k b n b matrix partitioning n b += n b k b One task

16 Parameter Tuning Algorithms in DL-BLAS

17 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

18 DL-BLAS performance modeling DL-BLAS performance = single-thread performance multi-thread speed-up We modeled the lower bound of multi-thread speed-up We created an algorithm to find quasi-maximum value of single-thread performance

19 Multi-thread speed-up and the number of sub tasks CPU Core Core Core Core If there are 3 available CPU cores and 4 sub tasks, they need 2 unit time. App 1 Example 1: sub-tasks are large The speed-up rate is 4 / 2 = 2 Parallel efficiency: 2 / 3 =.667 Execution time CPU Core Core Core Core App 1 Example 2: sub-tasks are small If there are 3 available CPU cores and 1 sub tasks, they need 4 unit time. the speed-up rate is 1 / 4 = 2.5 Parallel efficiency: 2.5 / 3 =.833 If there are only small number of tasks, the speed-up rate can be small

20 Trade-off between multi-thread speedup and single-thread performance Generally there are following trends Larger block size provides higher single-thread calculation performance Larger block size make the number of tasks smaller, and then load balance mechanism do not work well Multi-Thread speed-up rate Single-Thread performance Small block size and Large number of sub tasks Large block size and Small number of sub tasks

21 Multi-Thread speed-up rate (1/2) We denote parameters, p and i time CPU Core Core Core Core App 1 App 1 App2 App3 p i p: The number of physical cores (NOT be changed in a machine) i: available CPU cores (may be changed by other applications) Clearly, i p Letting the number of sub tasks be s, we define the speedup rate h(i, s) and h (i, s) as follows h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation

22 Multi-Thread speed-up rate (2/2) h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation h (i, s) i = 2 i = 3 i = 4 Can we find the lower bound of these graphs? s

23 We proved following theorem Theorem about speed-up rate ), '( p t p s i h p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time h (i, s) s t + = ) ( p s p s f 1. For every 1 < i p, t s

24 Example of speed-up rate Case: p = 4 In the evaluations, we used machines which have 4 physical CPU cores or 4 CPU threads h'( i, s) p 1 f ( s) = 1 s + p p 1 t + p p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time 1 < i p, t s.87 We created more than or equal to 17 tasks (We considered 85% is well parallelized) s

25 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

26 Searching space of block size Basic idea is follows Multi-Thread speed-up rate Single-Thread performance Well parallelized area Not so be different Small block size and Large number of tasks 17 tasks Large block size and Small number of tasks Block size s min.6 s max s max Search parameters in this range How to search?

27 Relation between block sizes and single-thread performance Exhaustive search results of calculating 1 by 1 square matrix multiplication on Q66 (Intel Core 2 Quad) += n b k b k b n b Highest performance block size single-thread performance Actual Trend Small block size and Large number of tasks Large block size and Small number of tasks

28 Diagonal Searching Algorithm Evaluating all parameters s min n b, k b s max need much time to calculate Diagonal Searching Algorithm decrease the calculation complexity Algorithm 1. Evaluate all paramters s min n b = k b s max, the highest performance block size be j 2. Search parameters in n b = j or k b = j n b j 3. Choose the highest performance block size found in STEP 1 and STEP 2 k b j This algorithm returned the parameters which gave us the highest performance on many architectures (Intel Core 2 Extreme, Intel Core i7, AMD Phenom, AMD Phenom II)

29 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

30 Relation between block size and problem size Multi-Thread speed-up rate is changed when the problem size is changed Single-Thread performance Well parallelized area for a larger problem Well parallelized area for a smaller problem Multi-Thread speed-up rate for a larger problem Small block size and Large number of tasks Large block size and Small number of tasks Multi-Thread speed-up rate for a smaller problem

31 Reductive Searching Algorithm We want to get multiple [n b, k b ] pairs n b Algorithm: n 2 n 1 1. Initialize s min and s max be [15, 25], and a = 1 2. In the range [s min, s max ], run Diagonal Searching Algorithm, and let the results of the solution be [n a, k a ] k b 15 k 2 3. If single-thread calculation is faster than multithread calculation, then last the algorithm 25 k 1 4. let [s min, s max ] = [.5 n a,.9 n a ] and add [n a, k a ] to result-list Result-list n b k b n 1 k 1 n 2 k 2 5. a := a + 1 and Go to STEP 2 Large paramter Small parameter

32 Parameter Selection For each problem, we select one pair of the parameters (n a, k a ) from the result-list using following algorithm n m C Algorithm = α n For i = 1 to length(result-list) num-tasks = n n i m / n i if num-tasks > 16 end return (n i, k i ); return // single-thread calculation End for k A m B k Result-list Large paramter n b k b n 1 k 1 n 2 k 2 n 3 k 4 Small parameter / Select largest pair of parameters which make more than 16 tasks

33 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate

34 Performance Evaluation Results

35 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate

36 Hardware and Problems We used some CPU models Intel Core 2 Extreme, QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 As first, we evaluate Diagonal Searching Algorithm using 1 by 1 square matrix multiplications

37 Results of Diagonal Searching Algorithm Intel Core 2 Extreme (QX965) Intel ATOM n b G FLOPS 15 n b G FLOPS k b k b 25 25G FLOPS G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm 1.69 GFLOPS Highest performance parameters 1.7 GFLOPS

38 Results of Diagonal Searching Algorithm Intel Core i7 965 with hyper-threading Intel Core i7 965 without hyper-threading n b 25 35G FLOPS n b 25 37G FLOPS k b k b 25 25G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters

39 Results of Diagonal Searching Algorithm AMD Phenom 96 AMD Phenom II X n b G FLOPS n b 25 35G FLOPS k b k b 25 1G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters

40 Results of Reductive Searching Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) Result-list is shown in right k b (24,24) (48,5) (56,56) This calculation need less than half an hour (112, 98) (168, 168) (224, 168) 25

41 Results of Parameter Selection Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) In square matrix multiplication, following parameters are used k b 25 (24,24) (48,5) (56,56) (112, 98) (168, 168) (224, 168) Dimension of square matrix ~48 49~64 65~96 97~ ~ ~ ~ ~ ~ Used parameter (n b, k b ) Single Thread (12,13) (16, 2) (24, 24) (48, 5) (56, 56) (112, 98) (168, 168) (224, 168)

42 Results of Reductive Searching Algorithm n b 25 AMD Phenom II X4 94 Result-list is shown in right (41, 4) (64, 4) k b (56, 56) (14, 8) This calculation need less than half an hour (92, 8) (132, 12) (172, 12) (26, 2) 25

43 Results of Parameter Selection Algorithm k b 25 n b (41, 4) (64, 4) (14, 8) (56, 56) (92, 8) (132, 12) (172, 12) (26, 2) 25 AMD Phenom II X4 94 In square matrix multiplication, following parameters are used Dimension of square matrix Used parameter (n b, k b ) ~164 Single Thread 165~224 (41, 4) 225~256 (56, 56) 257~368 (64, 4) 369~416 (92, 8) 417~528 (14, 8) 529~684 (132, 12) 685~892 (172, 12) 825~ (26, 2)

44 Evaluation Machines We solved square matrix multiplication problem with the following hardware and software CPU models Intel Core 2 Extreme QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 BLAS implementations DL-BLAS ATLAS GotoBLAS (version 1.26)

45 Evaluation Circumstances We make following three circumstances No other tasks are running (no-task case) CPU DL-BLAS calculation Busy loop program is running concurrently (busy-loop case) CPU while(true){ DL-BLAS calculation Busy-loop } Inner product (dot product) program is running concurrently (inner-product case) while(true){ CPU DL-BLAS calculation Inner product double ip = ; for i=1 to n ip += a[i] + b[i] } memory access

46 DGEMM calculation on QX965 (Intel Core 2 Extreme) performance (GFLOPS) GotoBLAS is fastest in no-task case, but slowest in busy-loop case and inner-product case DL-BLAS is faster than ATLAS in all cases DL-BLAS is fastest in busy-loop case and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS no-task busy-loop 4 4 inner-product performance (GFLOPS) matrix size matrix size matrix size 2 16 performance (GFLOPS)

47 DGEMM calculation on 965 (Intel Core i7) hyper-threading GotoBLAS is not so fast on this architecture Hyper-Thread is harmful K.Goto said, developer of GotoBLAS ATLAS is fastest in some cases Middle size of problems in busy-loop and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) 5 5 no-task busy-loop 5 5 inner-product performance (GFLOPS) matrix size matrix size 2 2 matrix size performance (GFLOPS)

48 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems CPU Core Core Core Core BLAS BLAS sub task sub task ATLAS for large problems CPU Core Core Core Core BLAS BLAS BLAS BLAS sub task sub task sub task sub task

49 DGEMM calculation on 965 (Intel Core i7) without hyper-threading GotoBLAS is fastest in no-task case In other cases, the performance of GotoBLAS is unstable Other tasks disturb GotoBLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 5 inner-product performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)

50 DGEMM calculation on Intel ATOM 33 with hyper-threading DL-BLAS is faster in busy-loop case and innerproduct case The performance of DL-BLAS is not faster than ATLAS in no-task case It is considered that DL-BLAS is disturbed by hyper-threading ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 2 inner-product 2 performance (GFLOPS) matrix size 1 matrix size 1 matrix size performance (GFLOPS)

51 DGEMM calculation on AMD Phenom 96 GotoBLAS is fastest in no-task case DL-BLAS is the fastest for small problems in busy-loop case and inner-product case For large problems, the performances are not so different GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 35 inner-product 35 performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)

52 DGEMM calculation on X4 94 (AMD Phenom II) GotoBLAS is a little faster than ATLAS and DL-BLAS in no-task case The performance of DL-BLAS is similar to that of ATLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop inner-product performance (GFLOPS) matrix size matrix size matrix size performance (GFLOPS)

53 Conclusion We implemented a BLAS implementation, which use dynamic load balancing We call the implementation DL-BLAS Our implementations need auto tuning technique to get block size parameters We call the techniques Diagonal Searching Algorithm, Reductive Searching Algorithm, and Parameter Selection In some cases, performance of DL-BLAS was better than ATLAS and GotoBLAS The performance of DL-BLAS is constantly not so wrong

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003 New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

On the efficiency of the Accelerated Processing Unit for scientific computing

On the efficiency of the Accelerated Processing Unit for scientific computing 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1, Mary W. Hall 2, Jacqueline Chame 3, Chun Chen 2, Paul D. Hovland 1 1 ANL/MCS 2 University

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Mark van Heeswijk 1 and Yoan Miche 2 and Erkki Oja 1 and Amaury Lendasse 1 1 Helsinki University of Technology - Dept. of Information

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel

More information

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum

More information

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Programming Dense Linear Algebra Kernels on Vectorized Architectures University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1 Parallel Algorithms 1 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2 Introduction It involves using the power of multiple processors in parallel to increase the performance

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Technical Report, CALDGEMM and HPL

Technical Report, CALDGEMM and HPL Technical Report, CALDGEMM and HPL David Rohr, Matthias Kretz, Matthias Bach Frankfurt Institute for Advanced Studies, University of Frankfurt, Germany December 9, 21 Abstract The LOEWE-CSC cluster at

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture

More information

An Overview of High Performance Computing and Challenges for the Future

An Overview of High Performance Computing and Challenges for the Future An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

MICROPROCESSOR ARCHITECTURE

MICROPROCESSOR ARCHITECTURE MICROPROCESSOR ARCHITECTURE UOP S.E.COMP (SEM-I) MULTICORE DESIGN Prof.P.C.Patil Department of Computer Engg Matoshri College of Engg.Nasik pcpatil18@gmail.com. History 2 History The most important part

More information

1. INTRODUCTION. ACM Transactions on Embedded Computing Systems, Vol. -, No. -, Article -, Publication date: 2011.

1. INTRODUCTION. ACM Transactions on Embedded Computing Systems, Vol. -, No. -, Article -, Publication date: 2011. - Exploiting Parallelism in Matrix-Computation Kernels for Symmetric Multiprocessor Systems Matrix-Multiplication and Matrix-Addition Algorithm Optimizations by Software Pipelining and Threads Allocation

More information

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Towards ABFT for BLIS GEMM

Towards ABFT for BLIS GEMM Towards ABFT for BLIS GEMM FLAME Working Note #76 Tyler M. Smith Robert A. van de Geijn Mikhail Smelyanskiy Enrique S. Quintana-Ortí Originally published June 13, 2015 Revised November 5, 2015 Abstract

More information

Resources for parallel computing

Resources for parallel computing Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Algorithms for Dense Matrices II

Algorithms for Dense Matrices II Algorithms for Dense Matrices II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-69120 Heidelberg phone: 06221/54-8264 email:stefan.lang@iwr.uni-heidelberg.de

More information