Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Size: px
Start display at page:

Download "Parallelization of DQMC Simulations for Strongly Correlated Electron Systems"

Transcription

1 Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun Bai (UCDavis) IEEE International Parallel and Distributed Processing Symposium 00 Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

2 Outline DQMC simulations DQMC parallelization Algorithmic approaches System approaches Experiment results Conclusion Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

3 Computational Material Science Understanding and exploiting the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... A. Density B. Density fluctuations C. Spin correlations D. Pairing correlations Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

4 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

5 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for Determinant Quantum Monte Carlo (DQMC) simulations. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

6 DQMC Algorithm Two stages: Random HS field Warmup stage Sampling stage A DQMC step Propose a local change: h h. Throw a random number 0 < r <. Accept the change if r < det(e βh(h ) ) det(e βh(h) ). warmup sampling DQMC step thermalized yes DQMC step Measurements enough samples yes Aggregation no no Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /

7 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

8 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

9 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Physical measurements Operations on G k and G τ, Fourier Transform, etc. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

10 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

11 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Numerical stability issues. Additional stabilizing steps are required. Most calculations need double precision. Many fast updating methods and parallel algorithms cannot be used. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

12 DQMC Parallelization Algorithmic approaches Parallel Markov chain Rolling feeder algorithm Parallel matrix computations System approaches Task decomposition Communication and computation overlapping Message compression Load balance Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /

13 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 9 /

14 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. The speedup of parallelization is limited by the time of the warmup stage. (Amdahl s law) ρ speedup = T warmup + T sampling T warmup + T sampling /N p < T warmup + T sampling T warmup warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /

15 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

16 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

17 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

18 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

19 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

20 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Numerically unstable! Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

21 Rolling Feeder Algorithm The matrix product can be stably computed sequentially. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

22 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

23 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Tasks to get one G k Sequential Parallel reduction Rolling feeder. Matrix multiplication L log L. Stabilization step O(L) O(log L). Inverting (I + B... B L ). Data transmission N O(LN ) N Comparisons on resources and stability Processor O() O(L) O(L) Numerically stable Y N Y Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

24 Parallel Matrix Computations Two matrix computation kernels are parallelized. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

25 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

26 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l The matrix-matrix multiplication of G k and each block matrix of G τ is speeded up using multicore. The matrix size of G k, , is too small such that the matrix computation cannot be benefited by using MPI-style parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

27 System Design The system contains several simulators for parallel Markov chain. Each simulator consists of a walker and a M-server. MC walker DQMC steps M-server Physical measurements Feeder Feeder Equal time measurator Feeder Iterator HS field GC GC GC Feeder Feeder Unequal time measurator Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

28 Implementation Techniques System is implemented for hybrid systems (cluster+multicore) Task MPI OpenMP Comm/comp Message Load overlapping compression balance Parallel Markov chain Rolling feeder algorithm Unequal time Green s fn Physical measure- ment Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

29 Communication/Computation Overlapping Iterator... HS field for time slice 0 G 0 Feeder Multiply B 0 and compute G 0 Iterator... Get G 0 by FUA HS field for time slice 0 MC iteration MC starts here iterates on G 0 G 0 Feeder Multiply B 0 and compute G 0 MC iterates on G 0 MC iteration starts here HS field for time slice Without overlapping Get G 0 and convert it to G by FUA MC iterates on G HS field for time slice Multiply B Using fast update algorithm (FUA) to reduce waiting time Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /

30 Load Balance Iterators are fully occupied the bottleneck of speedup. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 6 /

31 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

32 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

33 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Using the queueing theory (Little s law) to estimate. P n C = max. P λt λt λ: arrival rate; T : processing time; P: processor utilization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

34 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 7 /

35 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Benchmark DQMC simulation on a two-dimensional periodic lattice. The lattice size is N = 6 6 = 56. The ratio of DQMC steps for the warmup stage and the sampling stages is : 0. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

36 Communication Pattern Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Green bands show the waiting time of MPI RECV. Iterators are fully occupied after started. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /

37 Speedup for Different L Speedup np Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /

38 Effect of Load Balance (L = 96) Speedup np nc: number of block submatrices computed per processor. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

39 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

40 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

41 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Our implementation shows over 80x speedup on thousand processors, which is much better than embarrassing parallelization (speedup < ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

42 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

43 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

44 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

45 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Code is still in the experimental stage. Further development is required for practical use. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

Parallelization of DQMC Simulation for Strongly Correlated Electron Systems

Parallelization of DQMC Simulation for Strongly Correlated Electron Systems Parallelization of DQMC Simulation for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Hsin-Chu, Taiwan cherung@cs.nthu.edu.tw I-Hsin Chung IBM

More information

Mixed Granularity Parallelization Scheme for Determinant. Quantum Monte Carlo Simulations

Mixed Granularity Parallelization Scheme for Determinant. Quantum Monte Carlo Simulations Mixed Granularity Parallelization Scheme for Determinant Quantum Monte Carlo Simulations Che-Rung Lee I-Hsin Chung Zhaojun Bai April 13, 2009 Abstract With the ability to reveal the macroscopic properties

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Dense LU Factorization

Dense LU Factorization Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

QUEST User s Manual. April, 2012

QUEST User s Manual. April, 2012 QUEST User s Manual April, 2012 Contents 1 Introduction 2 2 Building QUEST 2 2.1 Basic install...................................... 2 2.2 Multicore processor and GPU support....................... 4 2.3

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

Patterns for! Parallel Programming!

Patterns for! Parallel Programming! Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program

More information

Patterns for! Parallel Programming II!

Patterns for! Parallel Programming II! Lecture 4! Patterns for! Parallel Programming II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Task Decomposition Also known as functional

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

On Constructing Lower Power and Robust Clock Tree via Slew Budgeting

On Constructing Lower Power and Robust Clock Tree via Slew Budgeting 1 On Constructing Lower Power and Robust Clock Tree via Slew Budgeting Yeh-Chi Chang, Chun-Kai Wang and Hung-Ming Chen Dept. of EE, National Chiao Tung University, Taiwan 2012 年 3 月 29 日 Outline 2 Motivation

More information

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky BIG CPU, BIG DATA Solving the World s Toughest Computational Problems with Parallel Computing Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Queuing Networks. Renato Lo Cigno. Simulation and Performance Evaluation Queuing Networks - Renato Lo Cigno 1

Queuing Networks. Renato Lo Cigno. Simulation and Performance Evaluation Queuing Networks - Renato Lo Cigno 1 Queuing Networks Renato Lo Cigno Simulation and Performance Evaluation 2014-15 Queuing Networks - Renato Lo Cigno 1 Moving between Queues Queuing Networks - Renato Lo Cigno - Interconnecting Queues 2 Moving

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

The Icosahedral Nonhydrostatic (ICON) Model

The Icosahedral Nonhydrostatic (ICON) Model The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

Introduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction

Introduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction Introduction to Matlab GPU Acceleration for Computational Finance Chuan- Hsiang Han 1 Abstract: This note aims to introduce the concept of GPU computing in Matlab and demonstrates several numerical examples

More information

Optimizing and Accelerating Your MATLAB Code

Optimizing and Accelerating Your MATLAB Code Optimizing and Accelerating Your MATLAB Code Sofia Mosesson Senior Application Engineer 2016 The MathWorks, Inc. 1 Agenda Optimizing for loops and using vector and matrix operations Indexing in different

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015 Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation

More information

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri Galin L. Jones 1 School of Statistics University of Minnesota March 2015 1 Joint with Martin Bezener and John Hughes Experiment

More information

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

GPU COMPUTING WITH MSC NASTRAN 2013

GPU COMPUTING WITH MSC NASTRAN 2013 SESSION TITLE WILL BE COMPLETED BY MSC SOFTWARE GPU COMPUTING WITH MSC NASTRAN 2013 Srinivas Kodiyalam, NVIDIA, Santa Clara, USA THEME Accelerated computing with GPUs SUMMARY Current trends in HPC (High

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan

More information

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating

More information

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Decomposition Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer

A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer Stéphane Vialle, Xavier Warin, Patrick Mercier To cite this version: Stéphane Vialle,

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers. Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers. G. Borrell, J.A. Sillero and J. Jiménez, Corresponding author: guillem@torroja.dmt.upm.es School of Aeronautics, Universidad

More information

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Parallel Reduction from Block Hessenberg to Hessenberg using MPI Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison

Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison Dept. of Mechanical and Environmental Informatics Kimura-Nakao lab. Uehara Daiki Today s outline 1. Self-introduction 2. Basics of

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CS377P Programming for Performance Multicore Performance Cache Coherence

CS377P Programming for Performance Multicore Performance Cache Coherence CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional

More information

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer

More information

Subtleties in the Monte Carlo simulation of lattice polymers

Subtleties in the Monte Carlo simulation of lattice polymers Subtleties in the Monte Carlo simulation of lattice polymers Nathan Clisby MASCOS, University of Melbourne July 9, 2010 Outline Critical exponents for SAWs. The pivot algorithm. Improving implementation

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,

More information

The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners

The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners Thomas Alrutz, Christian Simmendinger T-Systems Solutions for Research GmbH 02.06.2009

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum

Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY 2018 1 Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum Belle Andrea, Pourcelot Gregoire Abstract The aim of this

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Parallel Computing Basics, Semantics

Parallel Computing Basics, Semantics 1 / 15 Parallel Computing Basics, Semantics Landau s 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Data-intensive computing in radiative transfer modelling

Data-intensive computing in radiative transfer modelling German Aerospace Center (DLR) Remote Sensing Technology Institute (IMF) Data-intensive computing in radiative transfer modelling Dmitry Efremenko Diego Loyola Adrian Doicu Thomas Trautmann Dmitry.Efremenko@dlr.de

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Computer Performance Evaluation: Cycles Per Instruction (CPI) Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine

More information

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns COSC 6374 Parallel Computation Parallel Design Patterns Fall 2014 Design patterns A design pattern is a way of reusing abstract knowledge about a problem and its solution Patterns are devices that allow

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Subject Index. Journal of Discrete Algorithms 5 (2007)

Subject Index. Journal of Discrete Algorithms 5 (2007) Journal of Discrete Algorithms 5 (2007) 751 755 www.elsevier.com/locate/jda Subject Index Ad hoc and wireless networks Ad hoc networks Admission control Algorithm ; ; A simple fast hybrid pattern-matching

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup.

1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup. Parallel Algorithm Performance Analysis using OpenMP for Multicore Machines Mustafa B 1, Waseem Ahmed. 2 1 Department of CSE,BIT,Mangalore, mbasthik@gmail.com 2 Department of CSE, HKBKCE,Bangalore, waseem.pace@gmail.com

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016 Computational Methods in Statistics with Applications A Numerical Point of View L. Eldén SeSe March 2016 Large Data Sets IDA Machine Learning Seminars, September 17, 2014. Sequential Decision Making: Experiment

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading

More information

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl

More information

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Second Edition. Alan Kaminsky

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Second Edition. Alan Kaminsky Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences

More information

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future

More information

Adaptive Transpose Algorithms for Distributed Multicore Processors

Adaptive Transpose Algorithms for Distributed Multicore Processors Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks

More information