Parallelization of DQMC Simulations for Strongly Correlated Electron Systems
|
|
- Beatrice Dalton
- 5 years ago
- Views:
Transcription
1 Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun Bai (UCDavis) IEEE International Parallel and Distributed Processing Symposium 00 Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
2 Outline DQMC simulations DQMC parallelization Algorithmic approaches System approaches Experiment results Conclusion Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
3 Computational Material Science Understanding and exploiting the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... A. Density B. Density fluctuations C. Spin correlations D. Pairing correlations Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
4 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
5 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for Determinant Quantum Monte Carlo (DQMC) simulations. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
6 DQMC Algorithm Two stages: Random HS field Warmup stage Sampling stage A DQMC step Propose a local change: h h. Throw a random number 0 < r <. Accept the change if r < det(e βh(h ) ) det(e βh(h) ). warmup sampling DQMC step thermalized yes DQMC step Measurements enough samples yes Aggregation no no Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /
7 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
8 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
9 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Physical measurements Operations on G k and G τ, Fourier Transform, etc. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
10 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /
11 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Numerical stability issues. Additional stabilizing steps are required. Most calculations need double precision. Many fast updating methods and parallel algorithms cannot be used. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /
12 DQMC Parallelization Algorithmic approaches Parallel Markov chain Rolling feeder algorithm Parallel matrix computations System approaches Task decomposition Communication and computation overlapping Message compression Load balance Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /
13 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 9 /
14 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. The speedup of parallelization is limited by the time of the warmup stage. (Amdahl s law) ρ speedup = T warmup + T sampling T warmup + T sampling /N p < T warmup + T sampling T warmup warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /
15 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
16 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
17 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
18 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
19 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
20 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Numerically unstable! Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
21 Rolling Feeder Algorithm The matrix product can be stably computed sequentially. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
22 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
23 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Tasks to get one G k Sequential Parallel reduction Rolling feeder. Matrix multiplication L log L. Stabilization step O(L) O(log L). Inverting (I + B... B L ). Data transmission N O(LN ) N Comparisons on resources and stability Processor O() O(L) O(L) Numerically stable Y N Y Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
24 Parallel Matrix Computations Two matrix computation kernels are parallelized. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
25 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
26 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l The matrix-matrix multiplication of G k and each block matrix of G τ is speeded up using multicore. The matrix size of G k, , is too small such that the matrix computation cannot be benefited by using MPI-style parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
27 System Design The system contains several simulators for parallel Markov chain. Each simulator consists of a walker and a M-server. MC walker DQMC steps M-server Physical measurements Feeder Feeder Equal time measurator Feeder Iterator HS field GC GC GC Feeder Feeder Unequal time measurator Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
28 Implementation Techniques System is implemented for hybrid systems (cluster+multicore) Task MPI OpenMP Comm/comp Message Load overlapping compression balance Parallel Markov chain Rolling feeder algorithm Unequal time Green s fn Physical measure- ment Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
29 Communication/Computation Overlapping Iterator... HS field for time slice 0 G 0 Feeder Multiply B 0 and compute G 0 Iterator... Get G 0 by FUA HS field for time slice 0 MC iteration MC starts here iterates on G 0 G 0 Feeder Multiply B 0 and compute G 0 MC iterates on G 0 MC iteration starts here HS field for time slice Without overlapping Get G 0 and convert it to G by FUA MC iterates on G HS field for time slice Multiply B Using fast update algorithm (FUA) to reduce waiting time Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /
30 Load Balance Iterators are fully occupied the bottleneck of speedup. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 6 /
31 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
32 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
33 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Using the queueing theory (Little s law) to estimate. P n C = max. P λt λt λ: arrival rate; T : processing time; P: processor utilization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /
34 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 7 /
35 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Benchmark DQMC simulation on a two-dimensional periodic lattice. The lattice size is N = 6 6 = 56. The ratio of DQMC steps for the warmup stage and the sampling stages is : 0. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /
36 Communication Pattern Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Green bands show the waiting time of MPI RECV. Iterators are fully occupied after started. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /
37 Speedup for Different L Speedup np Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /
38 Effect of Load Balance (L = 96) Speedup np nc: number of block submatrices computed per processor. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /
39 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
40 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
41 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Our implementation shows over 80x speedup on thousand processors, which is much better than embarrassing parallelization (speedup < ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /
42 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
43 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
44 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
45 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Code is still in the experimental stage. Further development is required for practical use. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /
Parallelization of DQMC Simulation for Strongly Correlated Electron Systems
Parallelization of DQMC Simulation for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Hsin-Chu, Taiwan cherung@cs.nthu.edu.tw I-Hsin Chung IBM
More informationMixed Granularity Parallelization Scheme for Determinant. Quantum Monte Carlo Simulations
Mixed Granularity Parallelization Scheme for Determinant Quantum Monte Carlo Simulations Che-Rung Lee I-Hsin Chung Zhaojun Bai April 13, 2009 Abstract With the ability to reveal the macroscopic properties
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationDense LU Factorization
Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationQUEST User s Manual. April, 2012
QUEST User s Manual April, 2012 Contents 1 Introduction 2 2 Building QUEST 2 2.1 Basic install...................................... 2 2.2 Multicore processor and GPU support....................... 4 2.3
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationBenchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
More informationPatterns for! Parallel Programming!
Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program
More informationPatterns for! Parallel Programming II!
Lecture 4! Patterns for! Parallel Programming II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Task Decomposition Also known as functional
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationOn Constructing Lower Power and Robust Clock Tree via Slew Budgeting
1 On Constructing Lower Power and Robust Clock Tree via Slew Budgeting Yeh-Chi Chang, Chun-Kai Wang and Hung-Ming Chen Dept. of EE, National Chiao Tung University, Taiwan 2012 年 3 月 29 日 Outline 2 Motivation
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky
BIG CPU, BIG DATA Solving the World s Toughest Computational Problems with Parallel Computing Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationQueuing Networks. Renato Lo Cigno. Simulation and Performance Evaluation Queuing Networks - Renato Lo Cigno 1
Queuing Networks Renato Lo Cigno Simulation and Performance Evaluation 2014-15 Queuing Networks - Renato Lo Cigno 1 Moving between Queues Queuing Networks - Renato Lo Cigno - Interconnecting Queues 2 Moving
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationThe Icosahedral Nonhydrostatic (ICON) Model
The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationIntroduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction
Introduction to Matlab GPU Acceleration for Computational Finance Chuan- Hsiang Han 1 Abstract: This note aims to introduce the concept of GPU computing in Matlab and demonstrates several numerical examples
More informationOptimizing and Accelerating Your MATLAB Code
Optimizing and Accelerating Your MATLAB Code Sofia Mosesson Senior Application Engineer 2016 The MathWorks, Inc. 1 Agenda Optimizing for loops and using vector and matrix operations Indexing in different
More informationAnalysis of Matrix Multiplication Computational Methods
European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationParallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015
Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation
More informationBayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri
Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri Galin L. Jones 1 School of Statistics University of Minnesota March 2015 1 Joint with Martin Bezener and John Hughes Experiment
More informationParallel Hybrid Monte Carlo Algorithms for Matrix Computations
Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationGPU COMPUTING WITH MSC NASTRAN 2013
SESSION TITLE WILL BE COMPLETED BY MSC SOFTWARE GPU COMPUTING WITH MSC NASTRAN 2013 Srinivas Kodiyalam, NVIDIA, Santa Clara, USA THEME Accelerated computing with GPUs SUMMARY Current trends in HPC (High
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky
Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationHighly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010
Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationPrinciples of Parallel Algorithm Design: Concurrency and Decomposition
Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel
More informationParallel Computing. Parallel Algorithm Design
Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationA N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer
A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer Stéphane Vialle, Xavier Warin, Patrick Mercier To cite this version: Stéphane Vialle,
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationDirect Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.
Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers. G. Borrell, J.A. Sillero and J. Jiménez, Corresponding author: guillem@torroja.dmt.upm.es School of Aeronautics, Universidad
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationHyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison
Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison Dept. of Mechanical and Environmental Informatics Kimura-Nakao lab. Uehara Daiki Today s outline 1. Self-introduction 2. Basics of
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCS377P Programming for Performance Multicore Performance Cache Coherence
CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional
More informationMeasure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture
Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer
More informationSubtleties in the Monte Carlo simulation of lattice polymers
Subtleties in the Monte Carlo simulation of lattice polymers Nathan Clisby MASCOS, University of Melbourne July 9, 2010 Outline Critical exponents for SAWs. The pivot algorithm. Improving implementation
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationIntroduction to Optimization
Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,
More informationThe HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners
The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners Thomas Alrutz, Christian Simmendinger T-Systems Solutions for Research GmbH 02.06.2009
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationParallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum
KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY 2018 1 Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum Belle Andrea, Pourcelot Gregoire Abstract The aim of this
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationKey Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED
Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationParallel Computing Basics, Semantics
1 / 15 Parallel Computing Basics, Semantics Landau s 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support
More informationA GPU Sparse Direct Solver for AX=B
1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationData-intensive computing in radiative transfer modelling
German Aerospace Center (DLR) Remote Sensing Technology Institute (IMF) Data-intensive computing in radiative transfer modelling Dmitry Efremenko Diego Loyola Adrian Doicu Thomas Trautmann Dmitry.Efremenko@dlr.de
More informationCo-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming
More informationComputer Performance Evaluation: Cycles Per Instruction (CPI)
Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine
More informationCOSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns
COSC 6374 Parallel Computation Parallel Design Patterns Fall 2014 Design patterns A design pattern is a way of reusing abstract knowledge about a problem and its solution Patterns are devices that allow
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationSubject Index. Journal of Discrete Algorithms 5 (2007)
Journal of Discrete Algorithms 5 (2007) 751 755 www.elsevier.com/locate/jda Subject Index Ad hoc and wireless networks Ad hoc networks Admission control Algorithm ; ; A simple fast hybrid pattern-matching
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More informationHybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes
Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP
More informationPerformance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla
Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More information1. INTRODUCTION. Keywords: multicore, openmp, parallel programming, speedup.
Parallel Algorithm Performance Analysis using OpenMP for Multicore Machines Mustafa B 1, Waseem Ahmed. 2 1 Department of CSE,BIT,Mangalore, mbasthik@gmail.com 2 Department of CSE, HKBKCE,Bangalore, waseem.pace@gmail.com
More informationBig Orange Bramble. August 09, 2016
Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationComputational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016
Computational Methods in Statistics with Applications A Numerical Point of View L. Eldén SeSe March 2016 Large Data Sets IDA Machine Learning Seminars, September 17, 2014. Sequential Decision Making: Experiment
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationBIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Second Edition. Alan Kaminsky
Solving the World s Toughest Computational Problems with Parallel Computing Second Edition Alan Kaminsky Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences
More informationMUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA
The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More information