Generators at the LHC
|
|
- Clyde Randall
- 5 years ago
- Views:
Transcription
1 High Performance Computing for Event Generators at the LHC A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, Higgs boson production in association with a jet at NNLO using jettiness subtractions, R. Boughezal, C. Focke, W. Giele, X. Liu, F. Petriello, Z-boson production in association with a jet at next-tonext-to-leading order in perturbative QCD R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, Color singlet production at NNLO in MCFM, R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, C. Williams, 2016.
2 Introduction
3 LHC physics & Event Generators Experiments require precise predictions of known physics to extract new physics such as e.g. the Higgs boson. One of the goals at the LHC is to measure its properties of the newly discovered Higgs. This requires measuring the Higgs couplings to other particles at high precision. Given the precision of the LHC experiments measurements, more and more accurate predictions are needed (lots of background events compared to signal). This requires high performance computing to get the desired accuracies on the theory predictions. It forces us to focus on high performance computing in development of tools for the experimenters.
4 The MCFM parton level event generator The event generator we use is MCFM. An update on vector boson pair production at hadron colliders, J.M. Campbell, R.K. Ellis, Phys. Rev. D60: (1999) MCFM has been evolving since 1999 and currently it can make predictions for hundreds of processes at the LHC. Current LHC phenomenology requires a higher and higher precision. This requires to include higher order corrections to be calculated in the event generator. This will result in an exponential increase of required computer resources.
5 Making predictions MCFM makes predictions at the parton level order-by-order in the strong coupling constant. This means it will predict the jet momenta, and not its content. One can match a shower monte carlo such as PYTHIA to MCFM to get the particle content of the jets. The event generator consists of two parts: The calculation of to parton scattering amplitudes. The integration of the partons over phase space to get the observables.
6 Calculating scattering amplitudes To calculate the partonic scattering amplitudes on uses Feynman rules. This allows the calculation of observables order-by-order in the strong coupling constant. The calculations are algorithmic in nature. They can by calculated analytically or through algorithms. A lot of effort has gone into evaluating partonic scattering amplitudes at leading and next-to-leading order. It is a well defined task to do the calculation. The current effort is to calculate next-to-next-to leading order.
7 Phase space integration and Vegas The integration of the partonic scattering amplitude over phase space is more of black art. The integrations are high dimensional integrations (10-15 dimensions) over very complex functions. The standard tool used is VEGAS which is an adaptive integration algorithm. While imperfect, it performs well at next-toleading order. At next-to-next-to leading order it is struggling a bit. Here is where almost all the computer time is spent, typical ~10 9 events. The goal is to do this in order hours on a medium sized cluster
8 High Performance Computing: MPI Runs a copy of the program on each node. Does simple communications between nodes by sending messages (data) between the nodes allowing simple parallelization. It is simple to program, but requires a nonstandard extension of the compiler. It is made to run a job on different nodes, each with its own CPU and memory with limited exchange of data over a network. However, current CPU s include more and more computing cores (threads) for parallelization which will cause problems using MPI.
9 High Performance Computing: openmp Runs on a single motherboard with unified memory. Made to make use of the multi-threading on modern CPU chips. Supports shared memory between threads which is crucial to get good scaling. openmp is part of the C/C++/Fortran standard. It is straightforward to add openmp directives into your existing program. To speed-up your program some thought has to be given to use of memory.
10 High Performance Computing hybrid MPI/openMP While openmp is perfect for multi-threaded parallel programming, you still need a way to distribute it over a cluster. For this you can use MPI, making your program scale on large clusters such as the CORI cluster at NERSC as well as your local cluster or your own desktop. We could change the existing MCFM event generator pretty easily by adding openmp compiler directives and a few MPI instruction lines. Making it work and validation took awhile.
11 Paradigm shift in programming philosophy One important concept to understand in parallel programming is memory bound vs compute bound limits. We are used to use serial programming. Instead of recalculating things, storing and reusing data was often preferable. However, in parallel programming having many threads making memory requests will make all the threads sitting idle and the program speed is dictated by memory access (your program does not scale). By using shared memory or recalculating instead of storing data will overcome this and make your time scale with the number of used threads.
12 Paradigm shift in programming philosophy Ones first instinct is to run independent jobs, each with different random numbers on each thread and combine the different results. This will run into massive memory bound issues and no acceleration is obtain. Even worse often execution slows down significantly. Proper use of openmp is crucial for proper scaling which involves giving some thought about memory usage. Optimizing the shared memory usage is critical to reach the compute limit.
13 Making a parallel event generator
14 Putting it all together I gave an overview of all components needed to construct the event generator which will run and scale on modern processors and clusters. We can now put everything together and use it on realistic physical predictions to see how it works. The first step is to use openmp and get the event generator to scale properly on a single node/motherboard. A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, The next step is to build in support for running on clusters using MPI. Color singlet production at NNLO in MCFM, R. Boughezal, J.M. Campbell, R.K. Ellis, C. Focke, W. Giele, X. Liu, F. Petriello, C. Williams, 2016.
15 do i=1,iterations How to parallelize do j=1,nevents The Monte Carlo adaptive integration is done through VEGAS. In each iteration the grid is optimized using the nevent generated events. This means in the next iteration, the randomly generated events follow more the scattering amplitude. This allows a fairly simple parallelization of the event generation endo Optimize grid Evaluate a randomly generated event
16 do i=1,iterations Including openmp We use openmp so the inner loop is spread over the available threads. For the optimization of the grid the results of all threads are used. To debug the parallelized event generation, we ensured exactly the same events were generated independent of the number of threads used. Using this the bugs (due to parallelization) were readily exposed. do j=1,nevents Evaluate a randomly generated event endo Optimize grid
17 Hardware used We use 4 different configurations to test openmp version of MCFM: Standard desktop using an Intel core I (4 cores/8 threads, 3.4Ghz, 8MB cache) Double Intel x5650 processor (2x6 cores, 2.66Ghz,12Mb cache)). Quadruple AMD 6128 HE opteron (4x8 cores, 2Ghz, 12Mb cache. Xeon Phi co-processor (60 cores/240 threads, 1.1Ghz, 28.5 Mb). These are all single motherboards and could be in a workstation The Xeon Phi slots into the PCI-bus of a workstation.
18 First look at LO We see the effect of hyper threading on the Intel Core I7. The Intel Xeon scales very well and is fully compute bound. We see a memory bound issue for the AMD 6128 above 16 used treads Similar for the Xeon Phi coprocessor. Leading Order is not particularly computer intensive, we need more compute intensive processes.
19
20 NLO performance At Next-to-Leading order much more has to be calculated. As a result we see good scaling, without any memory bound issues The Xeon Phi co-processor has 60 processors each with 4 cores/threads. You can see some artifacts at 60/120/180 boundaries The overall performance of MCFM using openmp is very good, e.g. on the AMD motherboard performance is increases by ~32!
21
22 Distributions (I) The di-jet mass differential cross section for NLO PP H ( bb) + 2 jets Uses 1 hour of runtime on a single thread on the Intel I7 core and on the quadruple AMD 6128 HE. We can do useful phenomelogy studies at NLO with just an hour of run time using the multi-threaded version of MCFM. On the non-openmp version you would have to run order day(s) to get an equivalent result.
23 Distributions (II) Using 4x1,500,000+10x15,000,000 events. At LO it takes 12 min on the 12 threaded dual Intel Xeon X5650. At NLO it takes 22 hours on the 32 threaded quad AMD Opteron. (It would take around a month for a single thread evaluation on the Intel I7.)
24 Going to NNLO The LHC accuracy more and more necessitates going to Next-to-Next-to Leading Order. From LO NLO we went from ~10 minutes to ~10 hours using openmp on a single motherboard. For NNLO we would need month(s) we need to run on a cluster. This means we have to include MPI into the code.
25 do i=1,iterations Implementing MPI The implementation is easy by adding a few code lines. The syntax is somewhat awkward as is the compilation (requiring a modified compiler ). Because there is no shared memory, debugging is quite trivial compared to openmp. Because MPI is not standardized often some runtime tinkering is needed (depends a bit on cluster hardware etc). call mpi_bcast(xi,ngrid*mxdim,mpi_double_precision,. 0,mpi_comm_world,ierr)!$omp parallel do!$omp& schedule(dynamic)!$omp& default(private)!$omp& shared(incall,xi,ncall,ndim,sfun,sfun2,d,cfun,cfun2,cd)!$omp& shared(rank,size) do calls = 1, ncall/size do j=1,nevents Evaluate a randomly generated event endo Optimize grid
26 Hardware used We use 3 different configurations to test hybrid openmp/mpi version of MCFM: Double Intel x5650 processor (2x6 cores, 2.66Ghz,12Mb cache) which is part of a 24 node cluster. Quadruple AMD 6128 HE opteron (4x8 cores, 2Ghz, 12Mb cache) which is part of a 32 node cluster. Xeon Phi co-processor (60 cores/240 threads, 1.1Ghz, 28.5 Mb). The NERSC Cori cluster uses an more recent version of the Xeon Phi on each node: 9,668 single-socket compute nodes in the system. Each node contains an Intel Xeon Phi Processor 1.40GHz. 68 cores per node with support for 4 hardware threads each (272 threads total).
27 Scaling on NERSC The process is NLO PP H+2 jets. Two 6-core intel chips per node 6 openmp threads/mpi task Scales as expected up to ~5,000 threads (running on NERC) Note that above 5,000 threads we get low on events/thread and we become memory bound.
28 A first look at NNLO Runtime of pp W + for LO/NLO/NNLO from 1 up to 288 cores. The cluster consists of 24 nodes, each containing 2 processors of 6 cores. Two running modes: 1 MPI job per node: 1x12 (divided cache) 2 MPI jobs per node, i.e. 1 MPI job per processor: 2x6 Used 4x100,000+10x1,000,000 Vegas events. LO/NLO stopped scaling above 50/100 cores memory dominated regime. 1x12 runs slower than 2x6 because openmp does not have to sync cache between the 2 processors in the 2x6 case.
29 NNLO performance Better to run 1 MPI job/processor than 1 MPI job/node. LO is memory bound. NNLO is computing bound. Going an order higher in PQCD takes about order of magnitude in time. We see we can run NNLO W production in just over 5 minutes on 288 nodes.
30 Scaling behavior The NNLO scaling for all singlet processes included in MCFM 8.0 as a function of the number of MPI jobs. Used 4x100,000+10x1,000,000 Vegas events. Each MPI job is one processor with 6 cores. Only the PP H shows the onset of non-scaling at 48 MPI jobs. All other processes can be speed up efficiently using a larger cluster.
31 Scaling behavior Run times for all processes in the first release on NNLO MCFM. Other decay modes are also included. We see good scaling. For the simpler processes we see the memory bound limit transition starting. It will be no problem to run with times more events: still less than 24 hr.
32 Results for LHC
33 NNLO phenomenology With the hybrid openmp/mpi version of MCFM we can make NNLO predictions for the LHC. The uncertainties in the NNLO predictions should be sufficiently small compared to the experimental uncertainties. We can make accurate predictions on moderate clusters on a time scale of a day. As a consequence, we can now expand to more complicated final states such as e.g. pp V+jets
34 NNLO phenomenology Here are some results pp Z+ jet at NNLO. Thse are complicated processes and require a large cluster (like NERSC) to run. This process is not yet in the public version of MCFM. But it will be included in the next version (together with processes like pp W+jet, pp H+jet. pp photon+jet). We hope that improved methods of phase space integration will reduce the required run time.
35 Alternatives
36 Scaling on GPU s Thread-Scalable Evaluation of Multi-Jet Observables, W. Giele, G. Stavenga, J. Winter, 2010 Use a desktop with a multi-core processor and a Nvidia GPU. The most time consuming part on NNLO is the double bremsstrahlung tree level evaluation One can program a GPU to do tree level recursion relations! The speedup times in the table are on a GPU several generations out Expect an order of magnitude more gain on a modern GPU (from 0.66 Teraflops 5.5 Teraflops for DP).
37 Conclusions
38 Conclusions We successfully were able to make a multi-threaded version of MCFM, able to run and scale well on workstations and all sizes of big clusters. Our competitors have sofar not succeeded making their code parallel. With this publicly available threaded version of MCFM we can do efficient NNLO phenomenology at the LHC for color singlet (i.e. no jets) processes at the LHC. We are working on many fronts to include new processes and advance the numerical techniques such as phase space integration to be able to include more complicated processes at NNLO. The next version(s) of MCFM will also include pp V+jet, pp H+jet, pp photon+jet, pp VV,
Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes
Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015 NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationPartial Wave Analysis using Graphics Cards
Partial Wave Analysis using Graphics Cards Niklaus Berger IHEP Beijing Hadron 2011, München The (computational) problem with partial wave analysis n rec * * i=1 * 1 Ngen MC NMC * i=1 A complex calculation
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationPARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT
PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT V.A. Tokareva a, I.I. Denisenko Laboratory of Nuclear Problems, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region,
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationSoftware and computing evolution: the HL-LHC challenge. Simone Campana, CERN
Software and computing evolution: the HL-LHC challenge Simone Campana, CERN Higgs discovery in Run-1 The Large Hadron Collider at CERN We are here: Run-2 (Fernando s talk) High Luminosity: the HL-LHC challenge
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationA Study on Optimally Co-scheduling Jobs of Different Lengths on CMP
A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache
More informationarxiv: v1 [hep-lat] 1 Dec 2017
arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationSoftware within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual
Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationSCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE. x8534, x8505,
SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE Kenneth Kawahara (1) and Jonathan Lowe (2) (1) Analytical Graphics, Inc., 6404 Ivy Lane, Suite 810, Greenbelt, MD 20770, (240) 764 1500 x8534, kkawahara@agi.com
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationParallel Computing Ideas
Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:
More informationPerformance Evaluation. Recommended reading: Heidelberg and Lavenberg Computer Performance Evaluation IEEETC, C33, 12, Dec. 1984, p.
Thomas Clark 5/4/09 cs162 lecture notes cs162-aw Performance Evaluation Recommended reading: Heidelberg and Lavenberg Computer Performance Evaluation IEEETC, C33, 12, Dec. 1984, p. 1195 We ve been talking
More informationImproving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello
Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits Alexander M. Cappiello Department of Chemistry Carnegie Mellon University Pittsburgh, PA 15213 December 17, 2012 Abstract
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationMulti-threaded ATLAS Simulation on Intel Knights Landing Processors
Multi-threaded ATLAS Simulation on Intel Knights Landing Processors Steve Farrell, Paolo Calafiura, Charles Leggett, Vakho Tsulaia, Andrea Dotti, on behalf of the ATLAS collaboration CHEP 2016 San Francisco
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationBreaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport
Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationGPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations
GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations Fred Lionetti @ CSE Andrew McCulloch @ Bioeng Scott Baden @ CSE University of California, San Diego What is heart modeling? Bioengineer
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationIntroducing the Intel Xeon Phi Coprocessor Architecture for Discovery
Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Imagine The Possibilities Many industries are poised to benefit dramatically from the highly-parallel performance of the Intel Xeon
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationOutline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples
Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationMaximizing Memory Performance for ANSYS Simulations
Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationRunning the FIM and NIM Weather Models on GPUs
Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory Global Models 0 to 14 days 10 to 30 KM resolution
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationOverview of Parallel Computing. Timothy H. Kaiser, PH.D.
Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples
More informationarxiv: v1 [physics.ins-det] 11 Jul 2015
GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University
More informationParallelism and Concurrency. COS 326 David Walker Princeton University
Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMaking extreme computations possible with virtual machines
Journal of Physics: Conference Series PAPER OPEN ACCESS Making extreme computations possible with virtual machines To cite this article: J Reuter et al 2016 J. Phys.: Conf. Ser. 762 012071 View the article
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationReal-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010
1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.
More informationAccelerating koblinger's method of compton scattering on GPU
Available online at www.sciencedirect.com Procedia Engineering 24 (211) 242 246 211 International Conference on Advances in Engineering Accelerating koblingers method of compton scattering on GPU Jing
More informationHigh performance computing and numerical modeling
High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic
More informationOPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL
Journal of Theoretical and Applied Mechanics, Sofia, 2013, vol. 43, No. 2, pp. 77 82 OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL P. Dobreva Institute of Mechanics, Bulgarian
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationAllinea Unified Environment
Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationPerformance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf
PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland
More informationMap3D V58 - Multi-Processor Version
Map3D V58 - Multi-Processor Version Announcing the multi-processor version of Map3D. How fast would you like to go? 2x, 4x, 6x? - it's now up to you. In order to achieve these performance gains it is necessary
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationOverview of High Performance Computing
Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationOut-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)
Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationMachine Learning for (fast) simulation
Machine Learning for (fast) simulation Sofia Vallecorsa for the GeantV team CERN, April 2017 1 Monte Carlo Simulation: Why Detailed simulation of subatomic particles is essential for data analysis, detector
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationPARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS
Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.
More informationLecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming Hi! Hongyi Alex Kayvon Manish Parag One common definition A parallel computer is a collection of processing elements that cooperate
More informationEE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:
00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: http://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More information