Generators at the LHC

Size: px
Start display at page:

Download "Generators at the LHC"


1 High Performance Computing for Event Generators at the LHC A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, Higgs boson production in association with a jet at NNLO using jettiness subtractions, R. Boughezal, C. Focke, W. Giele, X. Liu, F. Petriello, Z-boson production in association with a jet at next-tonext-to-leading order in perturbative QCD R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, Color singlet production at NNLO in MCFM, R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, C. Williams, 2016.

2 Introduction

3 LHC physics & Event Generators Experiments require precise predictions of known physics to extract new physics such as e.g. the Higgs boson. One of the goals at the LHC is to measure its properties of the newly discovered Higgs. This requires measuring the Higgs couplings to other particles at high precision. Given the precision of the LHC experiments measurements, more and more accurate predictions are needed (lots of background events compared to signal). This requires high performance computing to get the desired accuracies on the theory predictions. It forces us to focus on high performance computing in development of tools for the experimenters.

4 The MCFM parton level event generator The event generator we use is MCFM. An update on vector boson pair production at hadron colliders, J.M. Campbell, R.K. Ellis, Phys. Rev. D60: (1999) MCFM has been evolving since 1999 and currently it can make predictions for hundreds of processes at the LHC. Current LHC phenomenology requires a higher and higher precision. This requires to include higher order corrections to be calculated in the event generator. This will result in an exponential increase of required computer resources.

5 Making predictions MCFM makes predictions at the parton level order-by-order in the strong coupling constant. This means it will predict the jet momenta, and not its content. One can match a shower monte carlo such as PYTHIA to MCFM to get the particle content of the jets. The event generator consists of two parts: The calculation of to parton scattering amplitudes. The integration of the partons over phase space to get the observables.

6 Calculating scattering amplitudes To calculate the partonic scattering amplitudes on uses Feynman rules. This allows the calculation of observables order-by-order in the strong coupling constant. The calculations are algorithmic in nature. They can by calculated analytically or through algorithms. A lot of effort has gone into evaluating partonic scattering amplitudes at leading and next-to-leading order. It is a well defined task to do the calculation. The current effort is to calculate next-to-next-to leading order.

7 Phase space integration and Vegas The integration of the partonic scattering amplitude over phase space is more of black art. The integrations are high dimensional integrations (10-15 dimensions) over very complex functions. The standard tool used is VEGAS which is an adaptive integration algorithm. While imperfect, it performs well at next-toleading order. At next-to-next-to leading order it is struggling a bit. Here is where almost all the computer time is spent, typical ~10 9 events. The goal is to do this in order hours on a medium sized cluster

8 High Performance Computing: MPI Runs a copy of the program on each node. Does simple communications between nodes by sending messages (data) between the nodes allowing simple parallelization. It is simple to program, but requires a nonstandard extension of the compiler. It is made to run a job on different nodes, each with its own CPU and memory with limited exchange of data over a network. However, current CPU s include more and more computing cores (threads) for parallelization which will cause problems using MPI.

9 High Performance Computing: openmp Runs on a single motherboard with unified memory. Made to make use of the multi-threading on modern CPU chips. Supports shared memory between threads which is crucial to get good scaling. openmp is part of the C/C++/Fortran standard. It is straightforward to add openmp directives into your existing program. To speed-up your program some thought has to be given to use of memory.

10 High Performance Computing hybrid MPI/openMP While openmp is perfect for multi-threaded parallel programming, you still need a way to distribute it over a cluster. For this you can use MPI, making your program scale on large clusters such as the CORI cluster at NERSC as well as your local cluster or your own desktop. We could change the existing MCFM event generator pretty easily by adding openmp compiler directives and a few MPI instruction lines. Making it work and validation took awhile.

11 Paradigm shift in programming philosophy One important concept to understand in parallel programming is memory bound vs compute bound limits. We are used to use serial programming. Instead of recalculating things, storing and reusing data was often preferable. However, in parallel programming having many threads making memory requests will make all the threads sitting idle and the program speed is dictated by memory access (your program does not scale). By using shared memory or recalculating instead of storing data will overcome this and make your time scale with the number of used threads.

12 Paradigm shift in programming philosophy Ones first instinct is to run independent jobs, each with different random numbers on each thread and combine the different results. This will run into massive memory bound issues and no acceleration is obtain. Even worse often execution slows down significantly. Proper use of openmp is crucial for proper scaling which involves giving some thought about memory usage. Optimizing the shared memory usage is critical to reach the compute limit.

13 Making a parallel event generator

14 Putting it all together I gave an overview of all components needed to construct the event generator which will run and scale on modern processors and clusters. We can now put everything together and use it on realistic physical predictions to see how it works. The first step is to use openmp and get the event generator to scale properly on a single node/motherboard. A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, The next step is to build in support for running on clusters using MPI. Color singlet production at NNLO in MCFM, R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, C. Williams, 2016.

15 do i=1,iterations How to parallelize do j=1,nevents The Monte Carlo adaptive integration is done through VEGAS. In each iteration the grid is optimized using the nevent generated events. This means in the next iteration, the randomly generated events follow more the scattering amplitude. This allows a fairly simple parallelization of the event generation endo Optimize grid Evaluate a randomly generated event

16 do i=1,iterations Including openmp We use openmp so the inner loop is spread over the available threads. For the optimization of the grid the results of all threads are used. To debug the parallelized event generation, we ensured exactly the same events were generated independent of the number of threads used. Using this the bugs (due to parallelization) were readily exposed. do j=1,nevents Evaluate a randomly generated event endo Optimize grid

17 Hardware used We use 4 different configurations to test openmp version of MCFM: Standard desktop using an Intel core I (4 cores/8 threads, 3.4Ghz, 8MB cache) Double Intel x5650 processor (2x6 cores, 2.66Ghz,12Mb cache)). Quadruple AMD 6128 HE opteron (4x8 cores, 2Ghz, 12Mb cache. Xeon Phi co-processor (60 cores/240 threads, 1.1Ghz, 28.5 Mb). These are all single motherboards and could be in a workstation The Xeon Phi slots into the PCI-bus of a workstation.

18 First look at LO We see the effect of hyper threading on the Intel Core I7. The Intel Xeon scales very well and is fully compute bound. We see a memory bound issue for the AMD 6128 above 16 used treads Similar for the Xeon Phi coprocessor. Leading Order is not particularly computer intensive, we need more compute intensive processes.


20 NLO performance At Next-to-Leading order much more has to be calculated. As a result we see good scaling, without any memory bound issues The Xeon Phi co-processor has 60 processors each with 4 cores/threads. You can see some artifacts at 60/120/180 boundaries The overall performance of MCFM using openmp is very good, e.g. on the AMD motherboard performance is increases by ~32!


22 Distributions (I) The di-jet mass differential cross section for NLO PP H ( bb) + 2 jets Uses 1 hour of runtime on a single thread on the Intel I7 core and on the quadruple AMD 6128 HE. We can do useful phenomelogy studies at NLO with just an hour of run time using the multi-threaded version of MCFM. On the non-openmp version you would have to run order day(s) to get an equivalent result.

23 Distributions (II) Using 4x1,500,000+10x15,000,000 events. At LO it takes 12 min on the 12 threaded dual Intel Xeon X5650. At NLO it takes 22 hours on the 32 threaded quad AMD Opteron. (It would take around a month for a single thread evaluation on the Intel I7.)

24 Going to NNLO The LHC accuracy more and more necessitates going to Next-to-Next-to Leading Order. From LO NLO we went from ~10 minutes to ~10 hours using openmp on a single motherboard. For NNLO we would need month(s) we need to run on a cluster. This means we have to include MPI into the code.

25 do i=1,iterations Implementing MPI The implementation is easy by adding a few code lines. The syntax is somewhat awkward as is the compilation (requiring a modified compiler ). Because there is no shared memory, debugging is quite trivial compared to openmp. Because MPI is not standardized often some runtime tinkering is needed (depends a bit on cluster hardware etc). call mpi_bcast(xi,ngrid*mxdim,mpi_double_precision,. 0,mpi_comm_world,ierr)!$omp parallel do!$omp& schedule(dynamic)!$omp& default(private)!$omp& shared(incall,xi,ncall,ndim,sfun,sfun2,d,cfun,cfun2,cd)!$omp& shared(rank,size) do calls = 1, ncall/size do j=1,nevents Evaluate a randomly generated event endo Optimize grid

26 Hardware used We use 3 different configurations to test hybrid openmp/mpi version of MCFM: Double Intel x5650 processor (2x6 cores, 2.66Ghz,12Mb cache) which is part of a 24 node cluster. Quadruple AMD 6128 HE opteron (4x8 cores, 2Ghz, 12Mb cache) which is part of a 32 node cluster. Xeon Phi co-processor (60 cores/240 threads, 1.1Ghz, 28.5 Mb). The NERSC Cori cluster uses an more recent version of the Xeon Phi on each node: 9,668 single-socket compute nodes in the system. Each node contains an Intel Xeon Phi Processor 1.40GHz. 68 cores per node with support for 4 hardware threads each (272 threads total).

27 Scaling on NERSC The process is NLO PP H+2 jets. Two 6-core intel chips per node 6 openmp threads/mpi task Scales as expected up to ~5,000 threads (running on NERC) Note that above 5,000 threads we get low on events/thread and we become memory bound.

28 A first look at NNLO Runtime of pp W + for LO/NLO/NNLO from 1 up to 288 cores. The cluster consists of 24 nodes, each containing 2 processors of 6 cores. Two running modes: 1 MPI job per node: 1x12 (divided cache) 2 MPI jobs per node, i.e. 1 MPI job per processor: 2x6 Used 4x100,000+10x1,000,000 Vegas events. LO/NLO stopped scaling above 50/100 cores memory dominated regime. 1x12 runs slower than 2x6 because openmp does not have to sync cache between the 2 processors in the 2x6 case.

29 NNLO performance Better to run 1 MPI job/processor than 1 MPI job/node. LO is memory bound. NNLO is computing bound. Going an order higher in PQCD takes about order of magnitude in time. We see we can run NNLO W production in just over 5 minutes on 288 nodes.

30 Scaling behavior The NNLO scaling for all singlet processes included in MCFM 8.0 as a function of the number of MPI jobs. Used 4x100,000+10x1,000,000 Vegas events. Each MPI job is one processor with 6 cores. Only the PP H shows the onset of non-scaling at 48 MPI jobs. All other processes can be speed up efficiently using a larger cluster.

31 Scaling behavior Run times for all processes in the first release on NNLO MCFM. Other decay modes are also included. We see good scaling. For the simpler processes we see the memory bound limit transition starting. It will be no problem to run with times more events: still less than 24 hr.

32 Results for LHC

33 NNLO phenomenology With the hybrid openmp/mpi version of MCFM we can make NNLO predictions for the LHC. The uncertainties in the NNLO predictions should be sufficiently small compared to the experimental uncertainties. We can make accurate predictions on moderate clusters on a time scale of a day. As a consequence, we can now expand to more complicated final states such as e.g. pp V+jets

34 NNLO phenomenology Here are some results pp Z+ jet at NNLO. Thse are complicated processes and require a large cluster (like NERSC) to run. This process is not yet in the public version of MCFM. But it will be included in the next version (together with processes like pp W+jet, pp H+jet. pp photon+jet). We hope that improved methods of phase space integration will reduce the required run time.

35 Alternatives

36 Scaling on GPU s Thread-Scalable Evaluation of Multi-Jet Observables, W. Giele, G. Stavenga, J. Winter, 2010 Use a desktop with a multi-core processor and a Nvidia GPU. The most time consuming part on NNLO is the double bremsstrahlung tree level evaluation One can program a GPU to do tree level recursion relations! The speedup times in the table are on a GPU several generations out Expect an order of magnitude more gain on a modern GPU (from 0.66 Teraflops 5.5 Teraflops for DP).

37 Conclusions

38 Conclusions We successfully were able to make a multi-threaded version of MCFM, able to run and scale well on workstations and all sizes of big clusters. Our competitors have sofar not succeeded making their code parallel. With this publicly available threaded version of MCFM we can do efficient NNLO phenomenology at the LHC for color singlet (i.e. no jets) processes at the LHC. We are working on many fronts to include new processes and advance the numerical techniques such as phase space integration to be able to include more complicated processes at NNLO. The next version(s) of MCFM will also include pp V+jet, pp H+jet, pp photon+jet, pp VV,

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015 NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Partial Wave Analysis using Graphics Cards

Partial Wave Analysis using Graphics Cards Partial Wave Analysis using Graphics Cards Niklaus Berger IHEP Beijing Hadron 2011, München The (computational) problem with partial wave analysis n rec * * i=1 * 1 Ngen MC NMC * i=1 A complex calculation

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information


PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT V.A. Tokareva a, I.I. Denisenko Laboratory of Nuclear Problems, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region,

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN Software and computing evolution: the HL-LHC challenge Simone Campana, CERN Higgs discovery in Run-1 The Large Hadron Collider at CERN We are here: Run-2 (Fernando s talk) High Luminosity: the HL-LHC challenge

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka,

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information



More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Optimized Scientific Computing:

Optimized Scientific Computing: Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache

More information

arxiv: v1 [hep-lat] 1 Dec 2017

arxiv: v1 [hep-lat] 1 Dec 2017 arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics

More information


CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP** CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the

More information

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON Contents 1. WHAT S

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information


SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE. x8534, x8505, SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE Kenneth Kawahara (1) and Jonathan Lowe (2) (1) Analytical Graphics, Inc., 6404 Ivy Lane, Suite 810, Greenbelt, MD 20770, (240) 764 1500 x8534,

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Parallel Computing Ideas

Parallel Computing Ideas Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:

More information

Performance Evaluation. Recommended reading: Heidelberg and Lavenberg Computer Performance Evaluation IEEETC, C33, 12, Dec. 1984, p.

Performance Evaluation. Recommended reading: Heidelberg and Lavenberg Computer Performance Evaluation IEEETC, C33, 12, Dec. 1984, p. Thomas Clark 5/4/09 cs162 lecture notes cs162-aw Performance Evaluation Recommended reading: Heidelberg and Lavenberg Computer Performance Evaluation IEEETC, C33, 12, Dec. 1984, p. 1195 We ve been talking

More information

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits Alexander M. Cappiello Department of Chemistry Carnegie Mellon University Pittsburgh, PA 15213 December 17, 2012 Abstract

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 Outline 1. Cache and shared memory parallel computing concepts.

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Multi-threaded ATLAS Simulation on Intel Knights Landing Processors

Multi-threaded ATLAS Simulation on Intel Knights Landing Processors Multi-threaded ATLAS Simulation on Intel Knights Landing Processors Steve Farrell, Paolo Calafiura, Charles Leggett, Vakho Tsulaia, Andrea Dotti, on behalf of the ATLAS collaboration CHEP 2016 San Francisco

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations Fred Lionetti @ CSE Andrew McCulloch @ Bioeng Scott Baden @ CSE University of California, San Diego What is heart modeling? Bioengineer

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery

Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Imagine The Possibilities Many industries are poised to benefit dramatically from the highly-parallel performance of the Intel Xeon

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph Scientific investigation traditionally takes two forms theoretical empirical

More information

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Maximizing Memory Performance for ANSYS Simulations

Maximizing Memory Performance for ANSYS Simulations Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Running the FIM and NIM Weather Models on GPUs

Running the FIM and NIM Weather Models on GPUs Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory Global Models 0 to 14 days 10 to 30 KM resolution

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Overview of Parallel Computing. Timothy H. Kaiser, PH.D. Overview of Parallel Computing Timothy H. Kaiser, PH.D. Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Parallelism and Concurrency. COS 326 David Walker Princeton University

Parallelism and Concurrency. COS 326 David Walker Princeton University Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary

More information



More information

Making extreme computations possible with virtual machines

Making extreme computations possible with virtual machines Journal of Physics: Conference Series PAPER OPEN ACCESS Making extreme computations possible with virtual machines To cite this article: J Reuter et al 2016 J. Phys.: Conf. Ser. 762 012071 View the article

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.

More information

Accelerating koblinger's method of compton scattering on GPU

Accelerating koblinger's method of compton scattering on GPU Available online at Procedia Engineering 24 (211) 242 246 211 International Conference on Advances in Engineering Accelerating koblingers method of compton scattering on GPU Jing

More information

High performance computing and numerical modeling

High performance computing and numerical modeling High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic

More information



More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information



More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center Stampede: Solicitation US National Science Foundation

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

Allinea Unified Environment

Allinea Unified Environment Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software 720.583.0380 Today s Challenge Q: What is the impact of current

More information



More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

Map3D V58 - Multi-Processor Version

Map3D V58 - Multi-Processor Version Map3D V58 - Multi-Processor Version Announcing the multi-processor version of Map3D. How fast would you like to go? 2x, 4x, 6x? - it's now up to you. In order to achieve these performance gains it is necessary

More information


HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune ( Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer Center for Embedded Computer

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Machine Learning for (fast) simulation

Machine Learning for (fast) simulation Machine Learning for (fast) simulation Sofia Vallecorsa for the GeantV team CERN, April 2017 1 Monte Carlo Simulation: Why Detailed simulation of subatomic particles is essential for data analysis, detector

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information


CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou ( ( Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information


PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming Hi! Hongyi Alex Kayvon Manish Parag One common definition A parallel computer is a collection of processing elements that cooperate

More information

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL: 00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: Offered by: David M. Koppelman 345 ERAD, 578-5482,,

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information