Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Similar documents
Generators at the LHC

HPC Architectures. Types of resource currently in use

Parallel Systems. Project topics

CUDA. Matthew Joyner, Jeremy Williams

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Parallel Algorithm Engineering

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Advances of parallel computing. Kirill Bogachev May 2016

Deutscher Wetterdienst

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Real Parallel Computers

Parallelism and Concurrency. COS 326 David Walker Princeton University

CME 213 S PRING Eric Darve

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

OpenACC Course. Office Hour #2 Q&A

Introduction to OpenMP

CUDA GPGPU Workshop 2012

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

WHY PARALLEL PROCESSING? (CE-401)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Optimized Scientific Computing:

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Case study: OpenMP-parallel sparse matrix-vector multiplication

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Allinea Unified Environment

! Readings! ! Room-level, on-chip! vs.!

Intel Xeon Phi Coprocessors

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

High performance computing and numerical modeling

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Programming. Libraries and Implementations

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Programming on Ranger and Stampede

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Trends in HPC (hardware complexity and software challenges)

High-Performance and Parallel Computing

General Purpose GPU Computing in Partial Wave Analysis

Hardware Sizing Guide OV

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

OP2 FOR MANY-CORE ARCHITECTURES

Pedraforca: a First ARM + GPU Cluster for HPC

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallel Programming Libraries and implementations

OpenMP for next generation heterogeneous clusters

Scientific Programming in C XIV. Parallel programming

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Multi-Processors and GPU

Software and Performance Engineering for numerical codes on GPU clusters

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Cube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Tesla Architecture, CUDA and Optimization Strategies

CPU-GPU Heterogeneous Computing

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

The Use of Cloud Computing Resources in an HPC Environment

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

Introduction to OpenMP

Advanced Processor Architecture

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Architectures

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Serial and Parallel Sobel Filtering for multimedia applications

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Introduction to parallel Computing

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Hybrid Implementation of 3D Kirchhoff Migration

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Illinois Proposal Considerations Greg Bauer

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Running HARMONIE on Xeon Phi Coprocessors

ORAP Forum October 10, 2013

Computer System Architecture

Technology for a better society. hetcomp.com

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Introduction to Multicore Programming

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Transcription:

Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015

NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real generator of NNLO W+jet: Boughezal, Focke,Liu,Petriello H+jet: Boughezal, Focke, Giele, Liu, Petriello. Better phase space: Generating double real and virtual-real events from fixed born phase space. Extending the Matrix Element Method to Next-to-Leading Order (Campbell, Giele, Williams) Fast performance on modern processors: Using multi-threading/multi-core (this talk) A Multi-Threaded Version of MCFM (Campbell, Ellis, Giele)

Single core vs Multiple cores The power used by a processor is given by P=C V^2 F C is the capacitance being switched/cycle V is the voltage (determines rate at which capacitance charges/discharges) F is clock frequency of processor By putting in 2 cores and halving the frequency we keep the same performance C 2C, F F/2, V V/2 So that P P/4. So by reducing the frequency one can increase the performance of the chip by putting in many cores

GPU's vs Multi-Core CPU's GPU's have many cores (up to 2880) giving a potential huge acceleration. For example at LO a speed-up of ~500 can be obtained on a desktop Thread-Scalable Evaluation of Multi-Jet Observables; (Giele, Stavenga, Winter) Need to carefully reprogram existing programs using CUDA and careful memory layout Special hardware required Not really usable for MCFM

GPU's vs Multi-Core CPU's Programs can easily be made to use multiple cores in parallel using openmp and MPI with minimal changes to the code Program will accelerate on laptops, desktops, clusters depending on number of cores available Processors are developing quickly, optimized to run openmp and other parallel methods E.g. The Xeon Phi co-processor card (240 threads, 1Ghz, 28MB cache) Xeon Phi E5 2699 processor (18 cores, 2.3-3.6Ghz, 45Mb cache)

openmp vs MPI OpenMP has a strict standard and is included in the gnu and intel compilers (just compile with the openmp flag) A program is can be made to run in parallel by adding simple pragma's, but can be challanging to debug OpenMP requires a unified memory address space for all threads (i.e. all threads access the same memory) Advantage: cache and memory is shared between all threads Disadvantage: Does not run on clusters Perfect for running on single multi-core processor (laptop, desktop, single node on cluster). It is hidden from the user as it automatically uses all available threads when program compiled with openmp flag.

openmp vs MPI To extend the parallelism to multiple nodes one can use MPI It runs independent programs each with its own memory space Data can be transferred between the MPI processes by passing messages MPI has several flavors and is less standardized. It requires to be installed and libraries and compiler extensions need to be made. We developed a hybrid MCFM/MPI MCFM version. So on each processor we can use its threads using openmp, while we use MPI to link the different openmp processes

Parallelism in event MC's The easiest way to use multiple threads is making Vegas parallel. The events to be evaluated are spread over all threads, while there is only a single Vegas grid e.g. using 100 processors linked with MPI, each using openmp with 5 threads gives us 500 threads. If Vegas uses 10x1,000,000 calls,each thread will evaluate 10x2,000 events. Between the 10 sweeps the Vegas grid is updated using all 500x2,000 events We ensure we get the same result using one thread or 500 threads. (This is indispensable for validation and debugging)

Parallelism in event MC's To get the same (and correct) result, independent of over how many threads the run is using one has to use Kahan summation to ensure no rounding errors occurs. Rounding errors in summation are an issue when using large number of events

Parallelism in event MC's Vegas code: call mpi_bcast(xi,ngrid*mxdim,mpi_double_precision,. 0,mpi_comm_world,ierr)!$omp parallel do!$omp& schedule(dynamic)!$omp& default(private)!$omp& shared(incall,xi,ncall,ndim,sfun,sfun2,d,cfun,cfun2,cd)!$omp& shared(rank,size) do calls = 1, ncall/size This is all as far as MPI is concerned (except of combining histograms when run is terminating) For openmp one has to specify whether variables are local to the thread of global Note: MPI has seperate vegas file/histograms per task,while openmp has a single veas file and histograms.

Performance of openmp MCFM We use H+2 jets at NLO as an example (Campbell, Ellis, Giele) This process is 'used' in the H+1 jet at NNLO in the jettiness approach so speedy evaluation is crucial We use 4 different configurations to test openmp MCFM: Standard desktop using an Intel core I7-4770 (4 cores, 3.4Ghz, 8MB cache) Double Intel x5650 processor (2x(6 cores, 2.66Ghz,12Mb cache)) Quadruple AMD 6128 HE opteron (4x(8 cores, 2Ghz, 12Mb cache) Xeon Phi co-processor (60 cores/240 threads, 1.1Ghz, 28.5 Mb)

4x1,000+10x10,000 events, 10 dimensional integration. I7: The processor has 4 cores each with 2 hyper threads The AMD and Xeon Psi become bandwidth bound (run time becomes dominated by memory fetch time)

4x1,000+10x10,000 events, 13 dimensional integration. NLO on the other hand is computational bound and scales as one would expect The xeon-phi exhibits a bit of slowdown above 60 (has 60 cores)

The dijet invariant mass distribution (5 GeV bins) at NLO 1 hour of runtime using 1 thread on the Intel I7 vs 32 threads on the quad AMD Opteron

The dijet invariant mass distribution (5 GeV bins) at NLO 4x1,500,000+10x15,000,000 events LO: 12 min on 12 threaded dual Intel Xeon X5650 NLO: 22 hours on the 32 threaded quad AMD Opteron

Performance of hybrid MCFM Boughezal, Focke, Giele, Liu, Petriello The hybrid mode is for running on clusters While NLO can easily be run on desktops, for NNLO clusters are still needed for reasonable run times Care needs to be taken with integer counters How to best use Vegas still under study (one does not want to freeze up the adaptive grid)

quad 6-core intel chips per node on NERSC 6 openmp threads/mpi task Scales as expected up to ~5,000 threads 17 events/thread/iteration

H+jet: Boughezal, Focke, Giele, Liu, Petriello. We used 15x10^7 events on 1,000 thread This results in ~30 min of runtime for the results in the published paper. With better jettiness phase space generators this time will improve further.

Conclusions We have successfully made a hybrid openmp/mpi version of MCFM It will give accelerations from laptops to large clusters using a single Vegas grid OpenMP allows one to keep cache unified on the chip. MPI links up all the processors, still using a single vegas grid MCFM can fully take advantage of the new chips with more and more cores and the clusters based on these chips (multi-petaflop clusters).