Implementation and evaluation of 3D FFT parallel algorithms based on software component model
|
|
- Hugh Watkins
- 5 years ago
- Views:
Transcription
1 Master 2 - Visualisation Image Performance University of Orléans ( ) Implementation and evaluation of 3D FFT parallel algorithms based on software component model Jérôme RICHARD October 7th 2014 under the supervision of Christian REZ and Vincent LANORE University tutor: Sophie ROBERT
2 High Performance Scientific applications require a lot of computing time Illustris Project Human Brain Project Parallelism used to reduce the computation time 2
3 High Performance Parallel programming paradigms Memory sharing One central memory readable/writable by multiple s Synchronization using locks, semaphores, transactional memory, etc. Examples : POSIX Threads, OpenMP, UPC Message passing Messages sent between s Synchronization using messages Collective communications Examples : MPI, PVM 3
4 High Performance Parallel architectures Clusters Supercomputers Large variety of hardware parallel architectures used Processing nodes (processor, memory, hard drive, etc.) Accelerator (GPU, FPGA, etc.) Network topology (fat tree, toric 3D grid, hypercube, etc.) Maximize Maximize performances performances Adapt Adapt the the software software to to the the Hardware Hardware 4
5 High Performance Observations Hardware frequently renewed Optimizing for a specific hardware is a time-consuming and costly process Adapting Adapting applications applications on on hardware hardware takes takes to to much much time time Internship 3D FFT computing easy to handle/optimize variants Uses of software components to handle variability Adaptation Adaptation of of 3D 3D FFT FFT algorithms algorithms using using software software component component 5
6 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 6
7 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 7
8 Sequential FFTs Fast Fourier Transform : fast algorithm which converts a discretized signal (matrix) from time domain to frequency domain Used in molecular dynamics, astrophysics, meteorology, 3D tomography, seismology, etc. (e.g. GROMACS) Reference algorithm : Cooley-Tukey Time complexity : O(n log(n)) FFT that use >1 dimensions can be decomposed in multiple 1D FFT : FFT sur les lignes Expensive Expensive on on larges larges matrices matrices FFT sur les colonnes 8
9 Parallel 3D FFTs X Z Y Z X Y 0 1 X Y Z D/Slab decomposition 2 transposition phases 2 calculation phases Scaling limit : Count N where N is the size of the input matrix Scaling Scaling limit limit is is an an issue issue on on current current supercomputers supercomputers 9
10 Parallel 3D FFTs X Y Z Y X X Z Z Y Y X 1 2D/Pencil decomposition 4 transposition phases 3 calculation phases Scaling limit : Nombre N² X Y Z Z Better Better scaling scaling limit limit but but more more expensive expensive in in term term of of network network operation operation
11 How to Optimize 3D FFTs Adapt the transposition algorithm (total exchange) Bruck1994, Prisacari2013 Temps d'exécution Computation-communication overlapping Kandalla2011 Calculs Communications Temps d'exécution Calculs Communications Gain Adapt the number of transpositions Pekurovsky2012 X Z Y X Z X Y Z Y 11 Optimizations Optimizations depend depend on on hardware hardware and and software software parameters parameters
12 Existing Implementations Sequential libraries Examples : ESSL, MKL, FFTW, etc. Parallel libraries Use sequential libraries Conditional compilation Examples : P3DFFT, 2DECOMP Issues : overlapping of highly optimized small pieces of code Compilation of a high level mathematical description Examples : FFTW, SPIRAL Issues : optimizations integrated into generators/compilers 12 Adding Adding new new optimizations optimizations is is difficult difficult and and can can be be aa tricky tricky process process
13 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 13
14 Component Models Build an application by assembling component instances Component Black box with interfaces Dissociates the interface and its implementation Interacts through its interfaces Assembly Component instances Interface connexion Benefits Clavier usb usb Ordinateur Separation of concerns Reuse 14
15 Distributed Component Models Examples of distributed component models CCM (Corba Component Model) GCM (Grid Component Model) Overhead Overhead not not acceptable acceptable for for HPC HPC applications applications Examples of HPC dedicated component models CCA (Common Component Architecture) No No MPI MPI in in the the model model et et granularity granularity limited limited to to process process L²C (Low Level Component) 15
16 Low Level Component (L²C) Component model that does not hide system issues Developed inside the Avalon team C++ or FORTRAN components that have attributes and ports Intra-process interactions with use/provide ports C++ FORTRAN Inter-process interactions Message-passing via MPI Remote procedure call via CORBA Assemblage describe by a file (processes, attributes, instances...) 16 Efficient Efficient assembly, assembly, but but exhaustive exhaustive description description Generation Generation
17 Models features Ordinateur portable Clavier Hierarchy : component instance can be an assembly (composite components) out out Souris in in s c Carte mère out e in Écran Connectors : element allowing to interconnect multiple instance ports without define specifically how Ordinateur client LAN hôte Ordinateur client Ordinateur Genericity : abstraction of components allowed by using generic component then specialized C = Ordinateur X=3 Squelette<C,X> C 17 Features Features provide provide by by HLCM HLCM
18 HLCM High Level Component Model Developed inside the Avalon team Features Hierarchy Generic Based on connectors Transform a high level assembly into a low level assembly Targeted low level model can be L²C, Gluon++, etc. 18
19 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 19
20 FFTs 3D Assembly Sequential Assembly Distributed Assembly Design assembly variants using Local transformations Global transformations 20
21 Basic FFT 3D Assembly Port provide Port use LocalDataInitializer (2) Instances de composants LocalDataFinalizer (6) LocalMaster (8) go Allocator (1) allocate Planifier1D_X (3) Sequential assembly init 8 tasks identified and mapped to components init Planifier2D_XY (3) ProcessingUnit (7) FFTW (4) 2 control tasks plan LocalTranspose_XZ (5) 6 computing tasks ProcessingUnit (7) FFTW (4) plan LocalTranspose_XZ (5) 3D 3D FFT FFT sequential sequential algorithms algorithms can can be be implemented implemented in in L²C L²C 21
22 Distributed FFTs 3D Assembly MPI MPI SlabDataInitializer (2) go SlabDataInitializer (2) MPI SlabDataFinalizer (6) SlabMaster (8) SlabDataFinalizer (6) Allocator (1) Allocator (1) allocate Planifier1D_X (3) Planifier1D_X (3) init init Planifier2D_XY (3) Planifier2D_XY (3) init init ProcessingUnit (7) FFTW (4) FFTW (4) plan MpiTransposeSync_XZ (5) plan MPI ProcessingUnit (7) MpiTransposeSync_XZ (5) ProcessingUnit (7) SlabMaster (8) allocate go FFTW (4) FFTW (4) plan MpiTransposeSync_XZ (5) plan MPI ProcessingUnit (7) MpiTransposeSync_XZ (5) Assembly duplicated on each MPI process 4 components replaced (providing a MPI interface) 22 1D 1D decomposition decomposition can can be be implemented implemented in in L²C L²C
23 Local Transformations Component replacement used to Adapt transpositions Switch the sequential implementation Attributes configuration used to Load balance work between instances Allowing overlapping LocalTranspose_XZ (5) MpiTranspose_XZ (5) LAD (XML) [ ] <property id="sizex"> <value type="uint64"> 256 </value> </property> [ ] 23
24 Global Transformations 2D decomposition MPI MPI PencilDataInitializer go PencilDataInitializer MPI PencilDataFinalizer PencilMaster PencilDataFinalizer Allocator Allocator allocate Planifier1D_X Planifier1D_X init init Planifier1D_X Planifier1D_X init init Planifier1D_X Planifier1D_X init init ProcessingUnit FFTW FFTW plan plan MPI MpiTransposeSync_XY ProcessingUnit ProcessingUnit FFTW FFTW plan plan MPI MpiTransposeSync_XZ ProcessingUnit ProcessingUnit FFTW MpiTransposeSync_XZ FFTW plan plan MPI ProcessingUnit MpiTransposeSync_XZ MpiTransposeSync_XY MpiTransposeSync_XZ MpiTransposeSync_XY PenciMaster allocate go MPI MpiTransposeSync_XY 24 2D 2D decomposition decomposition can can be be implemented implemented in in L²C L²C
25 Global Transformations Adaptation of the number of transpositions 25 Easy Easy to to adjust adjust the the number number of of transpositions transpositions
26 Variants 1D decomposition 2D decomposition L2C_1D_[...] L2C_2D_[...] Without load balancing With load balancing L2C_XD_[...] L2C_XDH_[...] Without overlapping With overlapping L2C_XD_[...] L2C_XDR_[...] Transposition count 1D Decomposition 2D Decomposition L2C_1D_1t_xz L2C_1D_1t_yz L2C_1D_2t_xz L2C_1D_2t_yz L2C_2D_2t_yzx L2C_2D_2t_zxy L2C_2D_3t_xzy L2C_2D_3t_yxz L2C_2D_3t_zyx L2C_2D_4t_xyz All assembly implemented (except those using both overlapping et 2D decomposition) 26 Some Some optimization optimization implemented implemented Many Many variants variants
27 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 27
28 Reuse Analysis Version Line count (C++ code) Proportion of reused code L2C_1D_2t_xz L2C_1D_1t_yz % L2C_1D_2t_yz % L2C_1D_2t_yz_blk % L2C_1DH_1t_yz % L2C_1DH_2t_yz_blk % L2C_2D_4t_xyz % L2C_2D_2t_zxy % L2C_2DH_2t_zxy % Total 30 components 3743 lines of C++ code 28 Assembly Assembly components components are are reusable reusable
29 Performance Measurement Experimental platform Grid'5000 and Curie Reference implementation FFTW and 2DECOMP Input matrix size 2p x 2p x 2p with p ℕ Launch count 100 Result computing method Average Error bars 1er et 3ème quartiles Compilers GCC on Grid'5000 and ICC on Curie MPI implementation OpenMPI 29
30 Performance Measurement and Scaling Large scale experiments on the Curie supercomputer Matrix size : 1024 x 1024 x D decomposition 2D decomposition 30 L²C L²C assemblies assemblies scale scale well well up up to to cores cores
31 Variants Evaluation and Analysis Heterogeneous experiments (on Grid'5000) Matrix bloc size experimentally adapted 1 slow cluster and 1 fast cluster used Matrix size : 256 x 256 x 256 1D decomposition 2D decomposition 31 Assemblies Assemblies implement implement load load balancing balancing
32 Plan Context 3D FFT Component Models Designing 3D FFT Assemblies with L²C Evaluating Re-usability and Performance Conclusion and Discussion 32
33 Conclusion and Perspectives Examine whether component models can handle variability of 3D FFT codes Contribution Design and implement 3D FFT algorithms with L²C Experiment on Grid'5000 et Curie Results Easy to handle variants (code highly reused between assembly) High performance (competitive with reference implementations) Perspectives Design higher level assembly using HLCM Implement selection algorithms to automatically specialize assemblies 33
Towards a Reconfigurable HPC Component Model
C2S@EXA Meeting July 10, 2014 Towards a Reconfigurable HPC Component Model Vincent Lanore1, Christian Pérez2 1 ENS de Lyon, LIP 2 Inria, LIP Avalon team 1 Context 1/4 Adaptive Mesh Refinement 2 Context
More informationOn Software Component for HPC Programmability and Efficiency
On Software Component for HPC Programmability and Efficiency Christian Perez Julien Bigot, Zhenxiong Hou, Vincent Lanore, Jérôme Richard Avalon, LIP, Lyon, France christian.perez@inria.fr Calcul Day @
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationIntroduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014
Introduction to MPI EAS 520 High Performance Scientific Computing University of Massachusetts Dartmouth Spring 2014 References This presentation is almost an exact copy of Dartmouth College's Introduction
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationHigh-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT
High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationA low level component model easing performance portability of HPC applications
A low level component model easing performance portability of HPC applications Julien Bigot, Zhengxiong Hou, Christian Pérez, Vincent Pichon To cite this version: Julien Bigot, Zhengxiong Hou, Christian
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationFFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal Cell System Features Heterogeneous multi-core system architecture Power Processor Element for control tasks
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationParallel FFT Libraries
Parallel FFT Libraries Evangelos Brachos August 19, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract The focus of this project is the area of the fast
More informationHybrid Model Parallel Programs
Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationAutomatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P
Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research
More informationHeterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core)
Heterogeneous Processing Systems Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core) Processing Heterogeneity CPU (x86, SPARC, PowerPC) GPU (AMD/ATI, NVIDIA) DSP (TI, ADI) Vector processors
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationKommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen
Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationAn Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston
An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.
More informationParallel algorithms at ENS Lyon
Parallel algorithms at ENS Lyon Yves Robert Ecole Normale Supérieure de Lyon & Institut Universitaire de France TCPP Workshop February 2010 Yves.Robert@ens-lyon.fr February 2010 Parallel algorithms 1/
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More information30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy
Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationMulti/Many Core Programming Strategies
Multicore Challenge Conference 2012 UWE, Bristol Multi/Many Core Programming Strategies Greg Michaelson School of Mathematical & Computer Sciences Heriot-Watt University Multicore Challenge Conference
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing Bootcamp for SahasraT 7th September 2018 Aditya Krishna Swamy adityaks@iisc.ac.in SERC, IISc Acknowledgments Akhila, SERC S. Ethier, PPPL P. Messina, ECP LLNL HPC tutorials
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More informationIBM High Performance Computing Toolkit
IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationModule 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching
The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationAn Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC
An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC Pi-Yueh Chuang The George Washington University Fernanda S. Foertter Oak Ridge National Laboratory Goal Develop an OpenACC
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationdesigning a GPU Computing Solution
designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationDR. JIVRAJ MEHTA INSTITUTE OF TECHNOLOGY
DR. JIVRAJ MEHTA INSTITUTE OF TECHNOLOGY Subject Name: - DISTRIBUTED SYSTEMS Semester :- 8 th Subject Code: -180701 Branch :- Computer Science & Engineering Department :- Computer Science & Engineering
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More information3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationSingle System Image OS for Clusters: Kerrighed Approach
Single System Image OS for Clusters: Kerrighed Approach Christine Morin IRISA/INRIA PARIS project-team Christine.Morin@irisa.fr http://www.irisa.fr/paris October 7th, 2003 1 Clusters for Scientific Computing
More informationCHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME
More informationThe MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola
The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola What is MPI MPI: Message Passing Interface a standard defining a communication library that allows
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationLattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)
Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,
More informationCHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer
CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationParallel Programming Environments. Presented By: Anand Saoji Yogesh Patel
Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationKerrighed: A SSI Cluster OS Running OpenMP
Kerrighed: A SSI Cluster OS Running OpenMP EWOMP 2003 David Margery, Geoffroy Vallée, Renaud Lottiaux, Christine Morin, Jean-Yves Berthou IRISA/INRIA PARIS project-team EDF R&D 1 Introduction OpenMP only
More informationA Low Level Component Model enabling Resource Specialization of HPC Applications
A Low Level Component Model enabling Resource Specialization of HPC Applications Julien Bigot, Zhengxiong Hou, Christian Pérez, Vincent Pichon To cite this version: Julien Bigot, Zhengxiong Hou, Christian
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationA Multi-layered Domain-specific Language for Stencil Computations
A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt Hardware/Software Co-Design, University of Erlangen-Nuremberg SPPEXA-Kolloquium, Erlangen, Germany; July 09, 2014 Challenges
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationTowards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics
Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More information