Mixed MPI-OpenMP EUROBEN kernels
|
|
- Frederick Lyons
- 5 years ago
- Views:
Transcription
1 Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany
2 Outline Short kernel description MPI and OpenMP paradigms Objectives and Porting activities Performances and Results Conclusions, Remarks and Future Works Probably nothing is new but this could be a good starting point to important and relevant considerations on the actual HPC ecosystem! 2
3 OBJECTIVES AND PORTING ACTIVITIES 3
4 Objectives 1. Starting from a simple (serial) C kernel, realize a parallel mixed version based on MPI and OpenMP (2 de facto standards) performance 2. Starting from a simple (serial) C kernel, evaluate the effort of the porting activity to mixed version productivity The kernels were chosen because are representative of complex computational kernels inside real applications. 4
5 Porting activity: the covered way From simple to multi-threading version using OpenMP (explicit approach) using multi-threaded library* (implicit approach) From simple to distributed parallel version using Message Passing Interface (MPI) and then Mixing multi-threading and distributed parallel versions 5
6 Porting activity: development platform PRACE Prototype INTI (provided by CEA) Bull cluster composed by 128 nodes (1024 cores) dual-socket quad-core Intel Nehalem 2.53 GHz 24 GBytes of memory on each node IB interconnection INTEL compiler suite (v ) Math Kernel Library (v ) Open MPI
7 Porting activity: mod2am Explicit multi-threading using OpenMP inner/middle/outer parallel loop & loop exchange with unrolling Refinements to allow automatic compiler SSE vectorization Implicit Multi-threading using numerical libraries CBLAS (open-source and MKL) MPI parallelization 1D and 2D (Cannon) block decomposition MPI communications based on MPI_send/MPI_recv, MPI_bcast, MPI_Isend/MPI_Irecv, MPI_sendrecv/MPI_cart 7
8 Porting activity: mod2am (details) mod2am --> v0 [ORIGINAL KERNEL] mod2am_omp-unrolled4 --> v0.1 [NOT COMMITTED] mod2am_omp-1_loop --> v0.2 mod2am_omp-2_loop --> v0.3 mod2am_omp-3_loop --> v0.4 mod2am_omp-nested --> v0.5 mod2am_omp-cblas --> v0.6 mod2am_mpi-1d --> v1.0 mod2am_mpi-1d-bcast --> v1.1 mod2am_mpi-1d-sendrecv --> v1.2 mod2am_mpi-1d-sendrecv-nonblock --> v1.3 mod2am_mpi-2d-cannon --> v2.0 mod2am_mpi-2d-cannon --> v2.0.1 (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v2.1 mod2am_mpi-2d-cannon-nonblock --> v (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v (-D CUBLAS -D PREPOSTED_NONBLOCKING) 8
9 Porting activity: mod2as Explicit multi-threading using OpenMP both for 0-index and 1-index CSR Implicit Multi-threading using numerical libraries Sparse BLAS (open-source* and MKL) MPI parallelization Trivial block-striped partitioning among all processors * NIST ( not multi-threaded 9
10 Porting activity: mod2as (details) mod2as --> v0 [ORIGINAL KERNEL] mod2as_omp --> v0.1 mod2as_omp-opt --> v0.2 mod2as_omp-opt-csr_0_index --> v0.3.0 [0-index CSR] mod2as_omp-opt-csr_1_index --> v0.3.1 [1-index CSR] mod2as_omp-sblas --> v0.4 [Sparse BLAS library (NIST interface) ]* mod2as_omp_sblas-mkl --> v0.5 [Sparse BLAS provided by Intel MKL ] mod2as_mpi-simple mod2as_mpi-sblas-mkl --> v1.0 [trivial block-striped partitioning among all processors] --> v1.1 [local calculation using MKL and final MPI_Reduce] 10
11 Porting activity: mod2f Explicit multi-threading using OpenMP Not done Implicit Multi-threading using numerical libraries FFTW2 & FFTW3 (open-source) MKL DFTI MKL wrapper for FFTW2 and FFTW3 MPI parallelization MPI FFTW (no multi-threaded) MKL 1D Cluster FFT (natively multi-threaded) 11
12 Porting activity: mod2f (details) mod2f --> v0 [ORIGINAL KERNEL] mod2f_fftw mod2f_fftw_mkl mod2f_fftw3 mod2f_fftw3_mkl mod2f_mkl mod2f_mpi mod2f_mpi_pfftw mod2f_mpi_mk --> v0.1 [multi-thread FFT provided by FFTW library] --> v0.1.1 [multi-thread FFT provided by MKL FFTW wrapper] --> v0.2 [multi-thread FFT provided by FFTW3 library] --> v0.2.1 [multi-thread FFT provided by MKL FFTW3 wrapper] --> v0.3 [the same as mkl/lrz/mod2f_mkl, little modifications were done] --> v1.0 [the same as base/c-mpi... 2D transformation!] --> v1.1 [1D distributed FFT using FFTW. No multi-threaded] --> v1.2 [1D distributed FFT using MKL Cluster FFT] 12
13 Porting activity: what is missing? mod2am Parallel BLAS (PBLAS) SUMMA: Scalable Universal Matrix Multiplication Algorithm DIMMA: Distribution-Independent Matrix Multiplication Algorithm mod2as mod2f Extension to multi-dimensional FFT Explicit OpenMP parallelization (but could it be really useful?) 13
14 PERFORMANCES AND RESULTS 14
15 Productivity evaluation: mod2am Time [hh:mm] Effort* SLOC** % OpenMP 0: ,4% MPI ~5: % OpenMP + MPI ~6:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 15
16 Productivity evaluation: mod2as Time [hh:mm] Effort* SLOC** % OpenMP ~2: ,7% MPI ~1: ,7% OpenMP + MPI ~3:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 16
17 Productivity evaluation: mod2f Time [hh:mm] Effort* SLOC** % OpenMP 2: % MPI ~2d:00:00*** % OpenMP + MPI ~2d:00:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces *** I spent two days to solve a problem thanks to the help of INTEL forum support 17
18 Performance: mod2am (1) 1,00E+05 Performance [MFlops] 1,00E+04 1,00E+03 1,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+01 Input dimension 18
19 Performance: mod2am (2) Scalability Explicit Implicit n. of threads 19
20 Performance: mod2as 2,60E+03 Performance [MFlops] 2,10E+03 1,60E+03 1,10E+03 6,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+02 Input dimension 20
21 Performance: mod2f Ops, we went out of time However Intel has recently published on his developer blog a presentation* about performance comparisons between MKL and FFTW. It covers the same strategies we followed during our porting activities 1D Cluster FFT implements distributed calculation using BLACS Performance comparisons for parallel/distributed version were made using input set larger than our (up to 2 23 ) * URL: 21
22 CONCLUSIONS, REMARKS AND FUTURE WORKS 22
23 General conclusions The porting activities concerning MPI-OpenMP were easy and fast OpenMP is easy but sometimes it is useless to trash time to try to use this paradigm (see mod2f) For well-know kernels, vendor multi-threading libraries are usually the winner choice If we want to look only at performances, we need to increase the input data set (especially when we use distributed versions of the kernels) 23
24 Remarks: integrate multi-threading libraries Distributed functions could have different prototypes and different conventions this requires knowledge about the library Native distributed functions are efficient and fast but do not ensure easy portability Different version of the same library could have different requirements in term of linking and name conventions Use safely the library (and the library must be safe by itself ) the usage of multi-threading libraries and OpenMP regions together requires to be careful 24
25 Remarks: how to realize the mixing There are 2 ways : 1. Serial Multi-threading (OpenMP) Parallel/Distributed Multi-threading (OpenMP+MPI) 2. Serial Parallel/Distributed (MPI) Multi-threading Parallel/ Distributed (MPI+OpenMP) Q: But are there differences? A: Of course! Because different goals have to be achieved at different level 25
26 (Possible) Future Works Replicate the porting activities by using Fortran instead of C Performance measurements with/without Simultaneous Multi- Threading (SMT) (Try to) Evaluate quantitatively the impact of thread affinity (OpenMP) and processes placement (MPI) Move to other architectures Evaluate the effort (time) to support other multi-threading libraries (from MKL to ACML, ESSL/PESSL, NAG, ) Evaluate if other (open-source) multi-threading libraries have more or less efficiency in term of performance than MKL (Try to) use OpenMP to manage transparently and efficiently the workload between multiple accelerators 26
27 Last but not least Let s start to play with real applications! MPI-OpenMP paradigm is mature enough to be used by production codes and today there are both good compilers and good libraries. MPI-OpenMP is increasing in importance as a programming model because many pure MPI programs do not exhibit good scalability using very large numbers (up to 1024) of MPI tasks. See Programming models: Hybrid programming with MPI & OpenMP (Carlo Cavazzoni, CINECA, Italy) during PRACE workshop on application porting and performance tuning at CSC, Finland (11-12 June, 2009). 27
28 THANK YOU FOR YOUR ATTENTION 28
John Hengeveld Director of Marketing, HPC Evangelist
MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationOCTOPUS Performance Benchmark and Profiling. June 2015
OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationarxiv: v2 [cs.pf] 19 Feb 2010
RapidMind: Portability across Architectures and its Limitations Iris Christadler and Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, D-85748 Garching bei München, Germany
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationMathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System
More informationIntel : Accelerating the Path to Exascale. Kirk Skaugen Vice President Intel Architecture Group General Manager Data Center Group
Intel : Accelerating the Path to Exascale Kirk Skaugen Vice President Intel Architecture Group General Manager Data Center Group 1 ZFlops 100 EFlops 10 EFlops 1 EFlops 100 PFlops 10 PFlops 1 PFlops 100
More informationIntel MPI Library Conditional Reproducibility
1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationMaximizing performance and scalability using Intel performance libraries
Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationCUDA Toolkit 4.0 Performance Report. June, 2011
CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse
More informationImproving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing
Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing Nagarajan Kathiresan, Ph.D., IBM India, Bangalore. k.nagarajan@in.ibm.com Agenda :- Different types
More informationPiz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design
Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500
More informationScaling Out Python* To HPC and Big Data
Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier0) Contributing sites and the corresponding computer systems for this call are: GENCI CEA, France Bull Bullx cluster GCS HLRS, Germany Cray
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationHybrid Model Parallel Programs
Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntroduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign
Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s
More informationPerformance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG
Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Holger Brunst Center for High Performance Computing Dresden University, Germany June 1st, 2005 Overview Overview
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationCUDA Toolkit 5.0 Performance Report. January 2013
CUDA Toolkit 5.0 Performance Report January 2013 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationCode Auto-Tuning with the Periscope Tuning Framework
Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationGOING ARM A CODE PERSPECTIVE
GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines
More informationExperiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems
Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Ashay Rane and Dan Stanzione Ph.D. {ashay.rane, dstanzi}@asu.edu Fulton High Performance Computing Initiative, Arizona
More informationWorkshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview
Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for
More informationIntel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*
Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationAn innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.
An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. Of Pisa Italy 29/02/2012, Nuremberg, Germany ARTEMIS ARTEMIS Joint Joint Undertaking
More informationORAP Forum October 10, 2013
Towards Petaflop simulations of core collapse supernovae ORAP Forum October 10, 2013 Andreas Marek 1 together with Markus Rampp 1, Florian Hanke 2, and Thomas Janka 2 1 Rechenzentrum der Max-Planck-Gesellschaft
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationHybrid (MPP+OpenMP) version of LS-DYNA
Hybrid (MPP+OpenMP) version of LS-DYNA LS-DYNA Forum 2011 Jason Wang Oct. 12, 2011 Outline 1) Why MPP HYBRID 2) What is HYBRID 3) Benefits 4) How to use HYBRID Why HYBRID LS-DYNA LS-DYNA/MPP Speedup, 10M
More informationAdvanced Threading and Optimization
Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d
More informationPractical High Performance Computing
Practical High Performance Computing Donour Sizemore July 21, 2005 2005 ICE Purpose of This Talk Define High Performance computing Illustrate how to get started 2005 ICE 1 Preliminaries What is high performance
More informationFuture Technologies (WP8) Prototype Evaluation & Research Activities. Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany
Future Technologies (WP8) Prototype Evaluation & Research Activities Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany Prototype Overview (1/2) CEA 1U Tesla Server T1070 (CUDA,
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationNew Features in LS-DYNA HYBRID Version
11 th International LS-DYNA Users Conference Computing Technology New Features in LS-DYNA HYBRID Version Nick Meng 1, Jason Wang 2, Satish Pathy 2 1 Intel Corporation, Software and Services Group 2 Livermore
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationJava Performance Analysis for Scientific Computing
Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000
More informationecse08-10: Optimal parallelisation in CASTEP
ecse08-10: Optimal parallelisation in CASTEP Arjen, Tamerus University of Cambridge at748@cam.ac.uk Phil, Hasnip University of York phil.hasnip@york.ac.uk July 31, 2017 Abstract We describe an improved
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationIntel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation
Intel Math Kernel Library (Intel MKL) Overview Hans Pabst Software and Services Group Intel Corporation Agenda Motivation Functionality Compilation Performance Summary 2 Motivation How and where to optimize?
More informationRapidMind & PGI Accelerator Compiler. Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften
RapidMind & PGI Accelerator Compiler Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften volker.weinberg@lrz.de PRACE Workshop New Languages & Future Technology Prototypes
More informationBenchmark runs of pcmalib on Nehalem and Shanghai nodes
MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical
More information[Potentially] Your first parallel application
[Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationExperiences with ENZO on the Intel Many Integrated Core Architecture
Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and
More informationSimulation using MIC co-processor on Helios
Simulation using MIC co-processor on Helios Serhiy Mochalskyy, Roman Hatzky PRACE PATC Course: Intel MIC Programming Workshop High Level Support Team Max-Planck-Institut für Plasmaphysik Boltzmannstr.
More informationSPIRAL, FFTX, and the Path to SpectralPACK
SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University www.spiral.net In collaboration with the SPIRAL and FFTX team @ CMU and LBL This work was supported by DOE ECP and
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationLS-DYNA Performance on Intel Scalable Solutions
LS-DYNA Performance on Intel Scalable Solutions Nick Meng, Michael Strassmaier, James Erwin, Intel nick.meng@intel.com, michael.j.strassmaier@intel.com, james.erwin@intel.com Jason Wang, LSTC jason@lstc.com
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More information