*Yuta SAWA and Reiji SUDA The University of Tokyo
|
|
- Aldous Mathews
- 5 years ago
- Views:
Transcription
1 Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi, Ltd.
2 Background Multi-Core CPUs in personal computers
3 CPUs on personal computers Year: Fastest Intel CPU (for PC) Cores (thread) Pentium Pentium 4 Pentium D Core i (8) GFLOPS How many dimensions of matrix multiplications can be solved in a second About 5 times in the dimension, and about 125 times in FLOPS
4 Users paradigm shift about usage of computers A decade ago, users ran applications sequentially, or the performance of the application was decreased CPU CPU Application App1 App2 time App1 App2 time Running sequentially Today, users can run a few applications concurrently without much overhead or Core App1 CPU Running concurrently disturbing each other Core App2 time
5 BLAS (Basic Linear Algebra Subprograms) BLAS are interfaces of multiplication routines of vectors and matrices BLAS are basic routine in numerical calculations BLAS are called from many other libraries such as LAPACK ARPACK LAPACK ScalaPACK PBLAS BLAS
6 DGEMM routine in BLAS DGEMM is the simplest double-precision matrix multiplication written as follows: m C := αab + βc k m n C := α A B + β C n k Computation complexity of DGEMM routine is O(mnk) This DGEMM routine is our target problem in this paper
7 BLAS on personal computers DGEMM routine have much parallelism In personal computers, the number of available CPU cores changes depends on other applications The number of physical cores p = 2 Core App 1 CPU Core Available cores for BLAS 1 2 Application App 1 App 2 Two parameters about the number of CPU cores time 2
8 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning time ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems ATLAS for large problems CPU Core Core Core Core CPU Core Core Core Core BLAS BLAS sub task sub task time BLAS BLAS BLAS BLAS sub task sub task sub task sub task time
9 Importance of Dynamic Load Balancing In traditional BLAS implementations, the size of each sub task was almost the same Core BLAS CPU Core Core BLAS BLAS Core BLAS sub task sub task sub task sub task Sub tasks with the same size If there are other tasks running concurrently, using dynamic load balancing is better time Sub tasks with the same size Dynamic load balancing CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task BLAS sub task time CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task sub task time
10 Dynamically Load-Balanced BLAS (DL-BLAS)
11 What is DL-BLAS? DL-BLAS is one of BLAS implementations DL-BLAS is well parallelized even if there are other applications running concurrently CPU Core Core Core Core App 1 BLAS sub task BLAS sub task BLAS sub task BLAS sub task App 2 Time Other Application
12 DL-BLAS parallelization algorithm DL-BLAS split a calculation into some tasks, and assign them to CPUs using dynamic load balancing Each CPU calculates tasks using BLAS implementation Original Calculation 1 2 Split into sub-tasks DL-BLAS provides Core 2 2 Core 1 1 Core 1 to cores Core n Assign sub-tasks Calculate using ATLAS Each core calculates sub-tasks
13 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm
14 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm
15 DL-BLAS task splitting DGEMM calculation can be written as C = αab + βc A is partitioned by (n b, k b ) matrices, B is by (k b, n b ) and C is be (n b, n b ) m k m n C += n A k B Block size: (n b, k b ) n b k b n b matrix partitioning n b += n b k b One task
16 Parameter Tuning Algorithms in DL-BLAS
17 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm
18 DL-BLAS performance modeling DL-BLAS performance = single-thread performance multi-thread speed-up We modeled the lower bound of multi-thread speed-up We created an algorithm to find quasi-maximum value of single-thread performance
19 Multi-thread speed-up and the number of sub tasks CPU Core Core Core Core If there are 3 available CPU cores and 4 sub tasks, they need 2 unit time. App 1 Example 1: sub-tasks are large The speed-up rate is 4 / 2 = 2 Parallel efficiency: 2 / 3 =.667 Execution time CPU Core Core Core Core App 1 Example 2: sub-tasks are small If there are 3 available CPU cores and 1 sub tasks, they need 4 unit time. the speed-up rate is 1 / 4 = 2.5 Parallel efficiency: 2.5 / 3 =.833 If there are only small number of tasks, the speed-up rate can be small
20 Trade-off between multi-thread speedup and single-thread performance Generally there are following trends Larger block size provides higher single-thread calculation performance Larger block size make the number of tasks smaller, and then load balance mechanism do not work well Multi-Thread speed-up rate Single-Thread performance Small block size and Large number of sub tasks Large block size and Small number of sub tasks
21 Multi-Thread speed-up rate (1/2) We denote parameters, p and i time CPU Core Core Core Core App 1 App 1 App2 App3 p i p: The number of physical cores (NOT be changed in a machine) i: available CPU cores (may be changed by other applications) Clearly, i p Letting the number of sub tasks be s, we define the speedup rate h(i, s) and h (i, s) as follows h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation
22 Multi-Thread speed-up rate (2/2) h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation h (i, s) i = 2 i = 3 i = 4 Can we find the lower bound of these graphs? s
23 We proved following theorem Theorem about speed-up rate ), '( p t p s i h p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time h (i, s) s t + = ) ( p s p s f 1. For every 1 < i p, t s
24 Example of speed-up rate Case: p = 4 In the evaluations, we used machines which have 4 physical CPU cores or 4 CPU threads h'( i, s) p 1 f ( s) = 1 s + p p 1 t + p p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time 1 < i p, t s.87 We created more than or equal to 17 tasks (We considered 85% is well parallelized) s
25 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm
26 Searching space of block size Basic idea is follows Multi-Thread speed-up rate Single-Thread performance Well parallelized area Not so be different Small block size and Large number of tasks 17 tasks Large block size and Small number of tasks Block size s min.6 s max s max Search parameters in this range How to search?
27 Relation between block sizes and single-thread performance Exhaustive search results of calculating 1 by 1 square matrix multiplication on Q66 (Intel Core 2 Quad) += n b k b k b n b Highest performance block size single-thread performance Actual Trend Small block size and Large number of tasks Large block size and Small number of tasks
28 Diagonal Searching Algorithm Evaluating all parameters s min n b, k b s max need much time to calculate Diagonal Searching Algorithm decrease the calculation complexity Algorithm 1. Evaluate all paramters s min n b = k b s max, the highest performance block size be j 2. Search parameters in n b = j or k b = j n b j 3. Choose the highest performance block size found in STEP 1 and STEP 2 k b j This algorithm returned the parameters which gave us the highest performance on many architectures (Intel Core 2 Extreme, Intel Core i7, AMD Phenom, AMD Phenom II)
29 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm
30 Relation between block size and problem size Multi-Thread speed-up rate is changed when the problem size is changed Single-Thread performance Well parallelized area for a larger problem Well parallelized area for a smaller problem Multi-Thread speed-up rate for a larger problem Small block size and Large number of tasks Large block size and Small number of tasks Multi-Thread speed-up rate for a smaller problem
31 Reductive Searching Algorithm We want to get multiple [n b, k b ] pairs n b Algorithm: n 2 n 1 1. Initialize s min and s max be [15, 25], and a = 1 2. In the range [s min, s max ], run Diagonal Searching Algorithm, and let the results of the solution be [n a, k a ] k b 15 k 2 3. If single-thread calculation is faster than multithread calculation, then last the algorithm 25 k 1 4. let [s min, s max ] = [.5 n a,.9 n a ] and add [n a, k a ] to result-list Result-list n b k b n 1 k 1 n 2 k 2 5. a := a + 1 and Go to STEP 2 Large paramter Small parameter
32 Parameter Selection For each problem, we select one pair of the parameters (n a, k a ) from the result-list using following algorithm n m C Algorithm = α n For i = 1 to length(result-list) num-tasks = n n i m / n i if num-tasks > 16 end return (n i, k i ); return // single-thread calculation End for k A m B k Result-list Large paramter n b k b n 1 k 1 n 2 k 2 n 3 k 4 Small parameter / Select largest pair of parameters which make more than 16 tasks
33 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate
34 Performance Evaluation Results
35 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate
36 Hardware and Problems We used some CPU models Intel Core 2 Extreme, QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 As first, we evaluate Diagonal Searching Algorithm using 1 by 1 square matrix multiplications
37 Results of Diagonal Searching Algorithm Intel Core 2 Extreme (QX965) Intel ATOM n b G FLOPS 15 n b G FLOPS k b k b 25 25G FLOPS G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm 1.69 GFLOPS Highest performance parameters 1.7 GFLOPS
38 Results of Diagonal Searching Algorithm Intel Core i7 965 with hyper-threading Intel Core i7 965 without hyper-threading n b 25 35G FLOPS n b 25 37G FLOPS k b k b 25 25G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters
39 Results of Diagonal Searching Algorithm AMD Phenom 96 AMD Phenom II X n b G FLOPS n b 25 35G FLOPS k b k b 25 1G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters
40 Results of Reductive Searching Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) Result-list is shown in right k b (24,24) (48,5) (56,56) This calculation need less than half an hour (112, 98) (168, 168) (224, 168) 25
41 Results of Parameter Selection Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) In square matrix multiplication, following parameters are used k b 25 (24,24) (48,5) (56,56) (112, 98) (168, 168) (224, 168) Dimension of square matrix ~48 49~64 65~96 97~ ~ ~ ~ ~ ~ Used parameter (n b, k b ) Single Thread (12,13) (16, 2) (24, 24) (48, 5) (56, 56) (112, 98) (168, 168) (224, 168)
42 Results of Reductive Searching Algorithm n b 25 AMD Phenom II X4 94 Result-list is shown in right (41, 4) (64, 4) k b (56, 56) (14, 8) This calculation need less than half an hour (92, 8) (132, 12) (172, 12) (26, 2) 25
43 Results of Parameter Selection Algorithm k b 25 n b (41, 4) (64, 4) (14, 8) (56, 56) (92, 8) (132, 12) (172, 12) (26, 2) 25 AMD Phenom II X4 94 In square matrix multiplication, following parameters are used Dimension of square matrix Used parameter (n b, k b ) ~164 Single Thread 165~224 (41, 4) 225~256 (56, 56) 257~368 (64, 4) 369~416 (92, 8) 417~528 (14, 8) 529~684 (132, 12) 685~892 (172, 12) 825~ (26, 2)
44 Evaluation Machines We solved square matrix multiplication problem with the following hardware and software CPU models Intel Core 2 Extreme QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 BLAS implementations DL-BLAS ATLAS GotoBLAS (version 1.26)
45 Evaluation Circumstances We make following three circumstances No other tasks are running (no-task case) CPU DL-BLAS calculation Busy loop program is running concurrently (busy-loop case) CPU while(true){ DL-BLAS calculation Busy-loop } Inner product (dot product) program is running concurrently (inner-product case) while(true){ CPU DL-BLAS calculation Inner product double ip = ; for i=1 to n ip += a[i] + b[i] } memory access
46 DGEMM calculation on QX965 (Intel Core 2 Extreme) performance (GFLOPS) GotoBLAS is fastest in no-task case, but slowest in busy-loop case and inner-product case DL-BLAS is faster than ATLAS in all cases DL-BLAS is fastest in busy-loop case and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS no-task busy-loop 4 4 inner-product performance (GFLOPS) matrix size matrix size matrix size 2 16 performance (GFLOPS)
47 DGEMM calculation on 965 (Intel Core i7) hyper-threading GotoBLAS is not so fast on this architecture Hyper-Thread is harmful K.Goto said, developer of GotoBLAS ATLAS is fastest in some cases Middle size of problems in busy-loop and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) 5 5 no-task busy-loop 5 5 inner-product performance (GFLOPS) matrix size matrix size 2 2 matrix size performance (GFLOPS)
48 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems CPU Core Core Core Core BLAS BLAS sub task sub task ATLAS for large problems CPU Core Core Core Core BLAS BLAS BLAS BLAS sub task sub task sub task sub task
49 DGEMM calculation on 965 (Intel Core i7) without hyper-threading GotoBLAS is fastest in no-task case In other cases, the performance of GotoBLAS is unstable Other tasks disturb GotoBLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 5 inner-product performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)
50 DGEMM calculation on Intel ATOM 33 with hyper-threading DL-BLAS is faster in busy-loop case and innerproduct case The performance of DL-BLAS is not faster than ATLAS in no-task case It is considered that DL-BLAS is disturbed by hyper-threading ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 2 inner-product 2 performance (GFLOPS) matrix size 1 matrix size 1 matrix size performance (GFLOPS)
51 DGEMM calculation on AMD Phenom 96 GotoBLAS is fastest in no-task case DL-BLAS is the fastest for small problems in busy-loop case and inner-product case For large problems, the performances are not so different GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 35 inner-product 35 performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)
52 DGEMM calculation on X4 94 (AMD Phenom II) GotoBLAS is a little faster than ATLAS and DL-BLAS in no-task case The performance of DL-BLAS is similar to that of ATLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop inner-product performance (GFLOPS) matrix size matrix size matrix size performance (GFLOPS)
53 Conclusion We implemented a BLAS implementation, which use dynamic load balancing We call the implementation DL-BLAS Our implementations need auto tuning technique to get block size parameters We call the techniques Diagonal Searching Algorithm, Reductive Searching Algorithm, and Parameter Selection In some cases, performance of DL-BLAS was better than ATLAS and GotoBLAS The performance of DL-BLAS is constantly not so wrong
Accelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationAlex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003
New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationOn the efficiency of the Accelerated Processing Unit for scientific computing
24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationBehavioral Data Mining. Lecture 12 Machine Biology
Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationAutotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology
Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1, Mary W. Hall 2, Jacqueline Chame 3, Chun Chen 2, Paul D. Hovland 1 1 ANL/MCS 2 University
More informationNUMA-aware Multicore Matrix Multiplication
Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationSolving Large Regression Problems using an Ensemble of GPU-accelerated ELMs
Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Mark van Heeswijk 1 and Yoan Miche 2 and Erkki Oja 1 and Amaury Lendasse 1 1 Helsinki University of Technology - Dept. of Information
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationImproving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationProgramming Dense Linear Algebra Kernels on Vectorized Architectures
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence
More informationAutomated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri
Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationParallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1
Parallel Algorithms 1 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2 Introduction It involves using the power of multiple processors in parallel to increase the performance
More informationA GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal
More informationAchieve Better Performance with PEAK on XSEDE Resources
Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationTechnical Report, CALDGEMM and HPL
Technical Report, CALDGEMM and HPL David Rohr, Matthias Kretz, Matthias Bach Frankfurt Institute for Advanced Studies, University of Frankfurt, Germany December 9, 21 Abstract The LOEWE-CSC cluster at
More informationBlocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.
Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationPerformance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors
Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationWorkshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview
Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationHigh-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen
High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationContinuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting
Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture
More informationAn Overview of High Performance Computing and Challenges for the Future
An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,
More informationA GPU Sparse Direct Solver for AX=B
1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks
More informationDAG-Scheduled Linear Algebra Using Template-Based Building Blocks
DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationMICROPROCESSOR ARCHITECTURE
MICROPROCESSOR ARCHITECTURE UOP S.E.COMP (SEM-I) MULTICORE DESIGN Prof.P.C.Patil Department of Computer Engg Matoshri College of Engg.Nasik pcpatil18@gmail.com. History 2 History The most important part
More information1. INTRODUCTION. ACM Transactions on Embedded Computing Systems, Vol. -, No. -, Article -, Publication date: 2011.
- Exploiting Parallelism in Matrix-Computation Kernels for Symmetric Multiprocessor Systems Matrix-Multiplication and Matrix-Addition Algorithm Optimizations by Software Pipelining and Threads Allocation
More informationCOSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science
Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense
More informationSummer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics
Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one
More informationCache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationTowards ABFT for BLIS GEMM
Towards ABFT for BLIS GEMM FLAME Working Note #76 Tyler M. Smith Robert A. van de Geijn Mikhail Smelyanskiy Enrique S. Quintana-Ortí Originally published June 13, 2015 Revised November 5, 2015 Abstract
More informationResources for parallel computing
Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationAlgorithms for Dense Matrices II
Algorithms for Dense Matrices II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-69120 Heidelberg phone: 06221/54-8264 email:stefan.lang@iwr.uni-heidelberg.de
More information