ARCHER ecse Technical Report. Algorithmic Enhancements to the Solvaware Package for the Analysis of Hydration. 4 Technical Report (publishable)
|
|
- Amberlynn Sullivan
- 5 years ago
- Views:
Transcription
1 4 Technical Report (publishable) ARCHER ecse Technical Report Algorithmic Enhancements to the Solvaware Package for the Analysis of Hydration Technical staff: Arno Proeme (EPCC, University of Edinburgh) PI: David Huggins (TCM, Department of Physics, University of Cambridge)
2 4.1 Acknowledgement This work was funded under the embedded CSE programme of the ARCHER UK National Supercomputing Service ( 4.2 Abstract This report details the work undertaken as part of ARCHER ecse project to port, optimise, and enhance the Solvaware software package for the analysis of solvation (hydration) in biomolecular systems. Each planned work package is reported against and results as well as a technical discussion are presented. 4.3 Introduction Solvaware consists of a suite of tools in the form of Perl scripts and C++ code that taken together largely automate a workflow enabling users to generate molecular dynamics (MD) trajectories and analyse these using a novel implementation of an algorithm based on the statistical mechanical method of inhomogeneous fluid solvation theory (IFST). This is a transformative technique in structure-based drug design, as it allows users to model and understand the thermodynamics of individual water molecules around biomolecules. The first stage of the workflow is to prepare the input structure and generate MD trajectories using the popular third-party high-performance MD package NAMD 1. The flow diagram in Figure 1 illustrates the workflow. Each of the steps in this stage is run in sequence and controlled by a Perl script. Each Perl script reads a Solvaware configuration file, which contains approximately 50 options. The last two steps run simulations using NAMD whereas the first two steps run utilities written in C++ that are part of Solvaware, as well as one external dependency, namely the freely available Solvate utility 2. Steps 1-2 are very quick, taking on the order of a minute to complete in serial on modern processor architectures. Depending on the system being studied and the computational resources allocated, step 3 (equilibration) takes on the order of 3 hours to complete, and step 4 (production dynamics run) takes on the order of 24 hours. In terms of storage, step 4 produces a NAMD output trajectory in the DCD binary file format that is 3GB-10GB, again depending on the size of the system
3 Preparation Read all input parameters from Solvaware configuration file. Read protein structure/connectivity from PDB and PSF file. Prepare protein structure and assign forcefield parameters. Setup Setup periodic boundary conditions. Solvate system. Set water model. Write PDB and PSF system structure files for NAMD. Equilibration Write NAMD input files for MD equilibration simulations. Write job scripts for MD equilibration and production simulations. Run MD equilibration simulation using NAMD. Dynamics Write NAMD input files for MD production simulations. Write job scripts for MD production simulations. Run production MD simulation using NAMD. Entropy Calculations MD trajectories analysed to. calculate per voxel entropies Energy Calculations MD trajectories analysed to calculate per voxel energies. Results Combine results from entropy and energy calculations to generate a PDB file for visualising per voxel free-energy density. Figure 1 - Flow diagram illustrating the workflow. The latter stages of the Solvaware workflow involve the IFST analysis, which is implemented in C++. This amounts to calculating the mean enthalpy and performing a k-nearest neighbour (KNN) search in 3 dimensions on snapshots from the MD trajectory to calculate the mean entropy. This allows the solvation (hydration) free energy for the solute to be computed, which can give novel insight into intermolecular interactions. 4.4 Project Work Packages WP1 Develop a Solvaware package on ARCHER An improved Solvaware package containing the enhancements implemented across all work packages has been made centrally available on ARCHER. In this section we describe the enhancements that were made specifically to facilitate
4 the creation, adoption, and efficient usage of this package by ARCHER users as well as its potential deployment and use on other, similar systems. As part of the process of porting Solvaware to ARCHER existing Perl scripts were generalised away from close integration with the University of Cambridge s Darwin machine on which they had thus far been used by removing explicit executable paths and options specific to this machine. Scripts were modified to read machine-specific options specified at package installation time or at run time, as appropriate, as well as settings supplied by the environment module system that is generically used to load and modify user environments including executable paths on high-performance computing systems. Scripts were modified to read some of the run-time options from a configuration file that users are already expected to modify to set up their problem for Solvaware to solve, thereby enhancing transparency and ease of use. Efficient machine-specific installation and runtime defaults for ARCHER and Darwin were included. Single-node and multi-node NAMD benchmarking was performed in order to determine optimal use of ARCHER for typical biomolecular systems simulated with NAMD in order to be analysed by Solvaware see figures 2 and 3. Figure 2 single-node NAMD benchmark of cb7amo test case on ARCHER. Best time is 96 seconds using 48 tasks (hyperthreading enabled). Best time without hyperthreading enabled is 124 seconds for 24 tasks. As shown in figure 2, benchmarking revealed that for systems typically explored using Solvaware a performance gain of 20-30% during the time-consuming MD simulation stage can be achieved by enabling Intel hyperthreads. This was included in the default job script for NAMD generated by Solvaware, thus ensuring efficient use of ARCHER by users of Solvaware.
5 Figure 3 multi-node NAMD benchmark of cb7amo test case on ARCHER. Parallel efficiency drops to ~50% from 6 nodes onwards. Also included in the package repository are an installation script, and an environment module file that serves as a template for deployment on other machines. Some scripts and utilities were rewritten to make the workflow more seamless especially for non-expert users, requiring less knowledge and intervention and providing more informative feedback during each step. Identical code duplicated across Perl scripts was refactored into a shared library, thereby facilitating future development work. Finally, documentation in the form of a user manual explaining how to use the package and detailing the meaning and relevance of available options was produced and made available on the ARCHER website. WP2 Improve Efficiency of k-nearest Neighbours Algorithm The molecular dynamics simulation performed in the first stage of the Solvaware workflow produces a trajectory for all the atoms in the system, i.e. both those that make up a biomolecule of interest, typically a protein, as well as atoms that make up the solvent molecules, typically water, surrounding the biomolecule. This is stored in a DCD format binary file as a series of snapshots, or frames, each of which contains the positions of all atoms at one instant of the simulation. A key part in the second stage of the Solvaware workflow is the determination by a k-nearest neighbours (KNN) algorithm of the k water molecules closest, under an appropriate distance metric and across all frames, to a given water molecule. Computing this for all water molecules allows the IFST algorithm to compute
6 entropies and hence, in combination with a computation of the energies, the solvation (hydration) free energy. Grid-based restructuring and nearest-voxels approximation Prior to this ecse project the KNN algorithm in the existing serial C++ code made a costly all-to-all comparison: for F frames chosen from the trajectory and W water molecules present in the system, and defining n = WF, the algorithm performed O(n 2 ) distance metric evaluations. This naïve algorithm was problematic as computation of hydration energy for large systems or with high accuracy (many frames) was too costly. To improve the efficiency of the KNN algorithm, it was restructured by introducing a three-dimensional spatial grid consisting of voxels, with one voxel associated to each point on the grid. This grid was first implemented as a threedimensional STL vector whose components are instances of a newly-defined Voxel class: vector<vector<vector<voxel> > > allvoxels Each Voxel in turn stores a vector of newly-defined Water objects: vector<water> waterposes_ Finally each Water contains 11 doubles that store the position of a water molecule and its orientation in a quaternion representation, as well as an integer that stores the frame in which this water molecule was encountered in the DCD file. Before the KNN algorithm executes, the desired trajectory data is read from the DCD file and all water molecules stored in voxels according to their spatial location. The size of the grid is determined by the size of the cell used during the MD simulations, which is identical throughout the workflow and specified in a configuration file. This restructuring allowed an extremely fruitful approximation to be made to the KNN algorithm, namely to restrict the KNN search to only consider water molecules located in voxels that surround the one containing a given water molecule of interest. The current implementation allows for arbitrarily far away voxels to be considered, down to just immediate neighbouring voxels. For the approximation to remain accurate if one chooses to increase the number of neighbouring waters, k, to be considered it stands to reason that one will need eventually to also extend the voxel search range to include voxels in which these further away neighbouring waters can be found. The benefits of avoiding O(n 2 ) scaling thanks to these enhancements to the code are already apparent at very modest scale. An IFST calculation over a mere 100 frames of a DCD file (which can contain O(10 6 ) frames in total) takes roughly 900 seconds using the old all-to-all KNN code compared to less than a second for the restructured grid-based version, where both codes use k=1 (considering nearest
7 water only) and with a restriction to immediate neighbour voxels (±1 in the x, y, and z directions) only in the grid-based case. The difference in the resulting entropy is less than 4%. Shared-memory Parallelisation This restructured code was however still serial, and could therefore not make efficient use of ARCHER compute nodes. It is in principle possible to modify the code to run concurrent but completely independent instances of the executable in such a way that each instance computes entropy contributions from a subsection of the grid following division of labour through instructions given to it at run time, with the results recomposed afterwards. Indeed this approach was taken to run the old code on the Darwin cluster. However this is not possible on ARCHER as the Application Level Placement Scheduler s aprun command does not allow for the execution of concurrent serial instances of an application on a single node. There was therefore a clear need to consider a parallel programming model such as OpenMP or MPI in order to enable the application to run at all efficiently on ARCHER. Moreover, introducing a parallel programming model into the code allows for much more flexibility than a static decomposition, with potential for load balancing, tuning of parallelism, and extension of the code beyond what can be dealt with through a simple static decomposition. Although the initial project proposal set out the intention to introduce a distributed-memory parallelisation using MPI, it was decided to proceed instead with a shared-memory parallelisation of the grid-structured code using OpenMP, for a number of reasons: (a) OpenMP was considered easier and quicker to implement incrementally as fewer modifications need to be made to the code and no thought needs to be given about how to decompose the problem and distribute the DCD data as would be needed in a distributed-memory approach. It is arguably also easier to debug, and provides a number of possible ways to tune parallel performance either through environment settings at runtime (e.g. loop scheduling) or by minor modifications to directives. (b) For the foreseeable future, and certainly for the near future, the majority of use cases were thought to involve analysis of DCD files that should fit into singlenode memory (64GB to 128GB available on ARCHER) in their entirety. (c) Memory profiling of a more sophisticated version of the IFST algorithm revealed memory usage to be a bottleneck. Whilst some memory-intensive bookkeeping variables could be eliminated by restructuring the code to use a three-dimensional vector of Voxels as opposed to the current one-dimensional vector of Voxels, it was deemed that one shared-memory instance with N threads would be less likely to hit the single-node memory bottleneck than N distributedmemory instances, which would all likely be duplicating some data.
8 (d) It avoids the need to resolve contention issues on reading DCD files, which was originally anticipated to be necessary and planned as Work Package 4. OpenMP Implementation The computationally intensive section of the restructured serial IFST code executes the KNN algorithm after the DCD file has been read in and its water molecules distributed over the grid s voxels. This section starts with three nested loops over the x, y and z dimensions of the grid, and is therefore naturally parallelised with an omp for construct (embedded in an omp parallel region): #pragma omp for collapse(3) schedule(runtime) for (unsigned int xvoxel=0; xvoxel < xvoxels; xvoxel++) { for(unsigned int yvoxel=0; yvoxel < yvoxels; yvoxel++) { for(unsigned int zvoxel=0; zvoxel < zvoxels; zvoxel++) {.. In order to obtain good performance it was found to be essential to use the collapse clause to collapse the iterations of all three loops into one larger iteration space of tasks, which are then assigned to available threads in accordance with the specified scheduling policy 3. This amounts to one thread for each voxel tackling the computation of k nearest neighbours and entropy contributions for waters located in that particular voxel. The runtime schedule was chosen to allow for easy experimentation with different thread scheduling policies (static, dynamic, guided, or auto) without needing to rebuild the code by setting the environment variable OMP_SCHEDULE. If OMP_SCHEDULE is not defined at runtime the default thread scheduling behaviour depends on the compiler. The GNU compiler, typically used to compile Solvaware, defaults to dynamic scheduling with a chunk size of 1 in case OMP_SCHEDULE is undefined. In order to avoid a performance penalty due to synchronisation on writes to shared variables, which would need to be protected, the contributions to the system s total entropy as well as to the spatially dependent distributions of entropy and density were written to threadprivate variables inside the parallel loop and manually reduced to the shared variables in a critical region following the parallel loop. Results were verified to be correct. 3 OpenMP API, version July 2013 sections and 4.1. Available at
9 Figure 4 - walltime spent in parallelised IFST calculation for cb7amo benchmark. Analysing 1000 DCD frames, with OMP_SCHEDULE not set hence corresponding to dynamic thread scheduling with chunksize 1 according to the GCC default. Figure 5 - Speedup of parallelised IFST calculation on ARCHER for cb7amo benchmark (same data as figure 4).
10 To summarise, the KNN algorithm and IFST calculation as a whole have been optimised serially, and parallelised to make efficient use of ARCHER. Timing results for the parallelised IFST calculation show excellent scaling see figures 4 and 5 with parallel efficiency of 85% at 24 threads. WP3 Implement k-means clustering A new utility has been added to the repository called ClusterDensityByKMeans. This utility takes uses the following inputs. The system PSF and PDB files generated in stage 2 of the workflow in figure 1 The DCD file generated in stage 4 of the workflow in figure 1 A user defined number of frames. A user defined hydration site radius A user defined region of interest, specified by a point in 3D and a radius The utility generates a PDB file containing atoms at regions with a high number density of water. The clustering was achieved using a k-means approach. More specifically, the water positions from each frame were overlaid to create a density profile. These points were then clustered using the k-means algorithm, exiting when all water positions were within the user defined hydration site radius from a cluster centre. The resulting PDB can be visualised in VMD or an alternative molecular visualiser. WP4 Optimise I/O of Molecular Dynamics trajectory The project proposal for this work package was based on a concern that each instance (e.g. MPI process) of the analysis software reads the same 3GB-10GB NAMD trajectory and that this would cause I/O file contention issues if they all attempt to read at the same time. Options considered were to stagger the reads, split up the DCD files, or pre-process the trajectories. However as discussed in the section on work package 2, the shared memory approach adopted for the IFST analysis resolves the concern over potential file contention issues. WP5 Extend functionality of Solvaware package A number of further enhancements were made to the Solvaware package in accordance with the intended goals set out in the project proposal. These include: The addition of an additional water model (TIP3P). The addition of another forcefield (CHARMM36) The facility to perform energy minimisation on the solute in preparation for MD simulation
11 Automated analysis of MD trajectory quality fed back to the user as part of the workflow. Example output of the temperature and energy is shown in Figure 6 and Figure 7 respectively. Figure 6 Output of the temperature from the automated analysis of MD trajectory quality Figure 7 Output of the energy from the automated analysis of MD trajectory quality 4.5 Conclusion Solvaware has been ported to ARCHER, giving the ARCHER community access to high-quality software for the analysis of biomolecular hydration. The software has been parallelised using OpenMP, the efficiency has been dramatically increased, and the functionality has been improved.
Parallel Programming Patterns Overview and Concepts
Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationDomain Decomposition: Computational Fluid Dynamics
Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will
More informationHow to run an MD simulation
How to run an MD simulation How to run an MD simulation Protocol for an MD simulation Initial Coordinates X-ray diffraction or NMR coordinates from the Protein Data Bank Coordinates constructed by modeling
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationDomain Decomposition: Computational Fluid Dynamics
Domain Decomposition: Computational Fluid Dynamics December 0, 0 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will
More informationOPENMP TIPS, TRICKS AND GOTCHAS
OPENMP TIPS, TRICKS AND GOTCHAS OpenMPCon 2015 2 Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! Extra nasty if it is e.g. #pragma opm atomic
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationDomain Decomposition: Computational Fluid Dynamics
Domain Decomposition: Computational Fluid Dynamics May 24, 2015 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will
More informationOPENMP TIPS, TRICKS AND GOTCHAS
OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB) markb@epcc.ed.ac.uk OpenMPCon 2015 OpenMPCon 2015 2 A bit of background I ve been teaching OpenMP for over 15 years
More informationecse08-10: Optimal parallelisation in CASTEP
ecse08-10: Optimal parallelisation in CASTEP Arjen, Tamerus University of Cambridge at748@cam.ac.uk Phil, Hasnip University of York phil.hasnip@york.ac.uk July 31, 2017 Abstract We describe an improved
More informationCFD exercise. Regular domain decomposition
CFD exercise Regular domain decomposition Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More information!OMP #pragma opm _OPENMP
Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is
More informationIntroduction to OpenMP. Lecture 2: OpenMP fundamentals
Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationRUNNING MOLECULAR DYNAMICS SIMULATIONS WITH CHARMM: A BRIEF TUTORIAL
RUNNING MOLECULAR DYNAMICS SIMULATIONS WITH CHARMM: A BRIEF TUTORIAL While you can probably write a reasonable program that carries out molecular dynamics (MD) simulations, it s sometimes more efficient
More informationA recipe for fast(er) processing of netcdf files with Python and custom C modules
A recipe for fast(er) processing of netcdf files with Python and custom C modules Ramneek Maan Singh a, Geoff Podger a, Jonathan Yu a a CSIRO Land and Water Flagship, GPO Box 1666, Canberra ACT 2601 Email:
More informationAutoIMD User s Guide
AutoIMD User s Guide AutoIMD version 1.7 Jordi Cohen, Paul Grayson March 9, 2011 Theoretical Biophysics Group University of Illinois and Beckman Institute 405 N. Mathews Urbana, IL 61801 Contents 1 Introduction
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationA Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark
A Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark M. S. Aljabri, P.W. Trinder April 24, 2013 Abstract Of the parallel programming models available OpenMP is the de facto
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationMixed Mode MPI / OpenMP Programming
Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,
More informationPorting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP
Porting and Optimisation of UM on ARCHER Karthee Sivalingam, NCAS-CMS HPC Workshop ECMWF JWCRP Acknowledgements! NCAS-CMS Bryan Lawrence Jeffrey Cole Rosalyn Hatcher Andrew Heaps David Hassell Grenville
More informationA Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh
A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory
More informationA Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful
Karl Gutwin May 15, 2005 18.336 A Parallel Implementation of a Higher-order Self Consistent Mean Field Effectively solving the protein repacking problem is a key step to successful protein design. Put
More informationMOLECULAR INTEGRATION SIMULATION TOOLKIT - INTERFACING NOVEL INTEGRATORS WITH MOLECULAR DYNAMICS CODES
MOLECULAR INTEGRATION SIMULATION TOOLKIT - INTERFACING NOVEL INTEGRATORS WITH MOLECULAR DYNAMICS CODES Iain Bethune (ibethune@epcc.ed.ac.uk, @ibethune) Elena Breitmoser Ben J Leimkuhler Advertisement Interlude!
More informationJULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING
JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationPetascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago
Petascale Multiscale Simulations of Biomolecular Systems John Grime Voth Group Argonne National Laboratory / University of Chicago About me Background: experimental guy in grad school (LSCM, drug delivery)
More informationIntroduction to OpenMP. Lecture 4: Work sharing directives
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationOpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing
CS 590: High Performance Computing OpenMP Introduction Fengguang Song Department of Computer Science IUPUI OpenMP A standard for shared-memory parallel programming. MP = multiprocessing Designed for systems
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationShared Memory Programming with OpenMP (3)
Shared Memory Programming with OpenMP (3) 2014 Spring Jinkyu Jeong (jinkyu@skku.edu) 1 SCHEDULING LOOPS 2 Scheduling Loops (2) parallel for directive Basic partitioning policy block partitioning Iteration
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced
More informationPlease cite the following papers if you perform simulations with PACE:
Citation: Please cite the following papers if you perform simulations with PACE: 1) Han, W.; Schulten, K. J. Chem. Theory Comput. 2012, 8, 4413. 2) Han, W.; Wan, C.-K.; Jiang, F.; Wu, Y.-D. J. Chem. Theory
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationIntroducing OpenMP Tasks into the HYDRO Benchmark
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Introducing OpenMP Tasks into the HYDRO Benchmark Jérémie Gaidamour a, Dimitri Lecas a, Pierre-François Lavallée a a 506,
More informationMolecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team
Molecular Modelling and the Cray XC30 Performance Counters Michael Bareford, ARCHER CSE Team michael.bareford@epcc.ed.ac.uk Reusing this material This work is licensed under a Creative Commons Attribution-
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2017 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationInstallation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz
Installation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz Rene Kobler May 25, 25 Contents 1 Abstract 2 2 Status 2 3 Changelog 2 4 Installation Notes 3 4.1
More informationHJCFIT Archer ecse end of project technical report
HJCFIT Archer ecse end of project technical report Jens Hedegaard Nielsen Raquel Alegre Remis Lape James Hetherington Lucia Sivilotti Abstract This report details the implementation of the Archer ecse
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationTask-based Execution of Nested OpenMP Loops
Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece Presentation Layout
More informationTRIREME Commander: Managing Simulink Simulations And Large Datasets In Java
TRIREME Commander: Managing Simulink Simulations And Large Datasets In Java Andrew Newell Electronic Warfare & Radar Division, Defence Science and Technology Organisation andrew.newell@dsto.defence.gov.au
More informationLab III: MD II & Analysis Linux & OS X. In this lab you will:
Lab III: MD II & Analysis Linux & OS X In this lab you will: 1) Add water, ions, and set up and minimize a solvated protein system 2) Learn how to create fixed atom files and systematically minimize a
More informationDomain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández
Domain Decomposition for Colloid Clusters Pedro Fernando Gómez Fernández MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004 Authorship declaration I, Pedro Fernando
More informationAdvanced OpenMP. Lecture 11: OpenMP 4.0
Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationAnalysis of Parallelization Techniques and Tools
International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of
More informationOpen Multi-Processing: Basic Course
HPC2N, UmeåUniversity, 901 87, Sweden. May 26, 2015 Table of contents Overview of Paralellism 1 Overview of Paralellism Parallelism Importance Partitioning Data Distributed Memory Working on Abisko 2 Pragmas/Sentinels
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationD1-1 Optimisation of R bootstrapping on HECToR with SPRINT
D1-1 Optimisation of R bootstrapping on HECToR with SPRINT Project Title Document Title Authorship ESPRC dcse "Bootstrapping and support vector machines with R and SPRINT" D1-1 Optimisation of R bootstrapping
More informationParallel Programming Patterns. Overview and Concepts
Parallel Programming Patterns Overview and Concepts Outline Practical Why parallel programming? Decomposition Geometric decomposition Task farm Pipeline Loop parallelism Performance metrics and scaling
More informationFall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12
Fall 2012 CSE 633 Parallel Algorithms Cellular Automata Nils Wisiol 11/13/12 Simple Automaton: Conway s Game of Life Simple Automaton: Conway s Game of Life John H. Conway Simple Automaton: Conway s Game
More informationLab III: MD II & Analysis Linux & OS X. In this lab you will:
Lab III: MD II & Analysis Linux & OS X In this lab you will: 1) Add water, ions, and set up and minimize a solvated protein system 2) Learn how to create fixed atom files and systematically minimize a
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationLARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS
ARCHER ECSE 05-14 LARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS TECHNICAL REPORT Vladimír Fuka, Zheng-Tong Xie Abstract The atmospheric large eddy simulation code ELMM (Extended Large-eddy
More informationIntroduction to. Slides prepared by : Farzana Rahman 1
Introduction to OpenMP Slides prepared by : Farzana Rahman 1 Definition of OpenMP Application Program Interface (API) for Shared Memory Parallel Programming Directive based approach with library support
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationNAMD Serial and Parallel Performance
NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance
More informationOpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono
OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/
More informationParallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group
Parallelising Scientific Codes Using OpenMP Wadud Miah Research Computing Group Software Performance Lifecycle Scientific Programming Early scientific codes were mainly sequential and were executed on
More informationIntroduction to OpenMP
1.1 Minimal SPMD Introduction to OpenMP Simple SPMD etc. N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 August 2011 SPMD proper is a superset of SIMD, and we are now going to cover some of the
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationOpenMP - Introduction
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationCS691/SC791: Parallel & Distributed Computing
CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP Part 2 1 OPENMP: SORTING 1 Bubble Sort Serial Odd-Even Transposition Sort 2 Serial Odd-Even Transposition Sort First OpenMP Odd-Even
More informationAteles performance assessment report
Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,
More informationA Lightweight OpenMP Runtime
Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationShared Memory Parallelism using OpenMP
Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत SE 292: High Performance Computing [3:0][Aug:2014] Shared Memory Parallelism using OpenMP Yogesh Simmhan Adapted from: o
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationShared Memory Programming with OpenMP
Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model
More informationOpenMP. A.Klypin. Shared memory and OpenMP. Simple Example. Threads. Dependencies. Directives. Handling Common blocks.
OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance with Dynamic schedule Data placement OpenMP and shared
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationImproved Database Development using SQL Compare
Improved Database Development using SQL Compare By David Atkinson and Brian Harris, Red Gate Software. October 2007 Introduction This white paper surveys several different methodologies of database development,
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationIntroduction to OpenMP
Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationExperiences from the Fortran Modernisation Workshop
Experiences from the Fortran Modernisation Workshop Wadud Miah (Numerical Algorithms Group) Filippo Spiga and Kyle Fernandes (Cambridge University) Fatima Chami (Durham University) http://www.nag.co.uk/content/fortran-modernization-workshop
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationSimulating the Suspension Response of a High Performance Sports Car
Simulating the Suspension Response of a High Performance Sports Car Paul Burnham McLaren Automotive McLaren Technology Centre, Chertsey Road, Woking, Surrey, GU21 4YH paul.burnham@mclaren.com Abstract
More information