Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

Similar documents
Geant4 Computing Performance Benchmarking and Monitoring

Multi-threaded ATLAS Simulation on Intel Knights Landing Processors

VLPL-S Optimization on Knights Landing

Intel Architecture for HPC

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Knights Landing Hardware

Does the Intel Xeon Phi processor fit HEP workloads?

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Machine Learning for (fast) simulation

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Intel Xeon Phi архитектура, модели программирования, оптимизация.

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Scientific Computing with Intel Xeon Phi Coprocessors

A Cost Model for Data Stream Processing on Modern Hardware Constantin Pohl, Philipp Götze, Kai-Uwe Sattler

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Architecture without explicit locks for logic simulation on SIMD machines

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Improving Packet Processing Performance of a Memory- Bounded Application

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Bei Wang, Dmitry Prohorov and Carlos Rosales

Beyond core count: a look at new mainstream computing platforms for HEP workloads

Introduction to tuning on KNL platforms

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

CME 213 S PRING Eric Darve

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Welcome. Virtual tutorial starts at BST

HPC Architectures. Types of resource currently in use

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Deep Learning with Intel DAAL

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Simulating Stencil-based Application on Future Xeon-Phi Processor

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Knights Corner: Your Path to Knights Landing

Introduction to tuning on KNL platforms

Umeå University

Umeå University

Efficient Parallel Programming on Xeon Phi for Exascale

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Reducing CPU Consumption of Geant4 Simulation in ATLAS

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

arxiv: v2 [hep-lat] 3 Nov 2016

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Directions in Workload Management

Master Informatics Eng.

Amrita Mathuriya HPC Application Engineer DCG/Intel Corporation November 2016

Parallel Algorithm Engineering

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

IXPUG 16. Dmitry Durnov, Intel MPI team

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Department of Informatics V. Tsunami-Lab. Session 4: Optimization and OMP Michael Bader, Alex Breuer. Alex Breuer

On the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Kevin O Leary, Intel Technical Consulting Engineer

Maximizing Memory Performance for ANSYS Simulations

Online Course Evaluation. What we will do in the last week?

Reusing this material

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Non-uniform memory access (NUMA)

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance Tools for Technical Computing

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Advances of parallel computing. Kirill Bogachev May 2016

Goverlan Reach Server Hardware & Operating System Guidelines

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Embree Ray Tracing Kernels: Overview and New Features

Introduction to KNL and Parallel Computing

AMD Opteron 4200 Series Processor

Unit 2: Technology Systems. Computer and technology systems

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Intel HPC Technologies Outlook

Towards Exascale Computing with the Atmospheric Model NUMA

CMS Simulation Software

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

LCDG4 at NIU Status and Plans

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Computer Architecture and Structured Parallel Programming James Reinders, Intel

A Parallelizing Compiler for Multicore Systems

Newest generation of HP ProLiant DL380 takes #1 position overall on Oracle E-Business Suite Small Model Benchmark

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Native Computing and Optimization. Hang Liu December 4 th, 2013

Transcription:

Geant4 MT Performance Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept. 12-16, 2016

Introduction Geant4 multi-threading (Geant4 MT) capabilities Event-level parallelism Available since 10.0 Status (see Andrea s talk in Plenary 7) Readiness for large-scale computing? Validation (not a scope of this talk) Performance Basic performance metrics Event throughput (weak scaling) Memory reduction Scopes of this talk MT performance on different hardware platforms Profiling results 2

Performance Profiling Experiments Application : a standalone CMS detector simulation the CMS geometry (gdml) a volume based magnetic field map excerpted from CMSSW single particle samples (50 GeV pi-,e-) and PYTHIA H à ZZ cmsexp (sequential) and cmsexpmt (multi-threading) Platform tested for this talk Intel Xeon X5650: dual-socket 6-core (total 12 cores), 12GB AMD Opertron 6128: quad-socket 8-core (total 32 cores), 64GB Intel Xeon Phi 5110P (MIC, Knight s Corner): 60 cores, 8GB Intel Xeon Phi (Knight s Landing), 64 cores, 96GB+16MCDRAM Profiling tools Open Speedshop (OSS) v2.2 Intel VTune Amplifier XE (VTune) 2016 3

MT Performance on General Purpose CPUs: Intel vs. AMD Event throughput = the number of event processed/time Speedup efficiency: ε Nthreads =,-./01-203(56706839:;),-./01-203 (=3-.6:>? ) Nthreads Intel Xeon X5650 AMD Opteron 6128 What to understand (Geant4 10.2.r06) MT is (sometimes) faster than sequential Degradation as the number of threads increases in AMD 4

Profiling Comparison: Intel Xeon OSS compare: Sequential vs. MT with1 thread (% of time) Reported time: 1951 (s1) vs. 1878 (t1) seconds for 1028 events of 50 GeV pions (10.2.r06) t1 s1 A hint of difference in SteppingManager, but not conclusive Need to cross-check the number of steps/tracks (by the particle type) 5

Profiling Comparison: AMD Opertron OSS compare: 32 threads vs. 1 thread (% of time) Experiment with 1028 events of 50 GeV pions (10.2.r06) t32 t1 Clear signs of difference in G4Navigator and ParticleChangesForTransport::UpdateStepForAlongsStep 6

H à ZZ : Intel vs. AMD Speedup efficiency (ε) as the number of threads (10.2.r06) The number of events processed = 50 x Nthreads Intel Xeon X5650 AMD Opteron 6128 7

Profiling Comparison: Intel Xeon OSS compare: Intel sequential vs. MT 1 thread (% of time) Experiments with 50 events of H à ZZ (10.2.r06) 8

Profiling Comparison: AMD Opertron OSS compare: MT 2 threads vs. MT 32 thread (% of time) Experiments with 50xNthreads events of H->ZZ (10.2.r06) Again hints of difference: adding counters for the number of steps tracks by the particle type for MT 9

MT performance: Xeon Phi 5110 (MIC, Knight s Corner) cmsexp on MIC: 5 GeV pi- (Events = 1028 x N-threads) 60 cores (4 way hyper-threading), 1.03 GHz, 7.8 GB memory Significant scalability loss from N threads = 2 to N threads = 4 Hit memory limit (~7.3 GB available) @ 120 threads Need to re-measure throughput with physics samples (threshold for the memory limit and the maximal cores to utilize) 10

MT Performance on KNL Performance on Intel Xeon Phi Processor (Knight's Landing ) Developer Edition: Single Socket 1.30 GHz, 64 core MEMORY: 96GB, 2133MHz DDR4, 16GB MCDRAM memory Geant4 10.2.p02 with -xmic-avx512 Experiment with N-threads x1028 Events of 5 GeV pi- (10.2.r06) 11

Profiling Results: KNL Hotspots with N-threads = 256 _L_unlock/lock 12

Profiling Results: KNL (N threads = 256) G4LogicalVolume::initialiseWorker _L_lock: also called by G4LogicalVolume::initialiseWorker 13

Profiling Results: KNL (N threads = 256) G4NavigationLevel::operator= Also seen with N threads = 198 and 128 14

Geant4MT: Profiling Result ( N threads =128) 15

Summary Reviewed Geant4 MT performance standalone CMS detector simulation (single particle, HàZZ) Performance on different systems and profiling results No major issues on Intel Xeon Degradation seen on AMD as the number of threads is partially understood Xeon Phi (KNC) shows problems at N-threads > 120 Xeon Phi (KNL) shows stable performance More tests to understand results of AMD/KNL profiling data Examine stepping information on AMD and KNL (sequential vs. 1-threads and 1-thread vs. N-threads) Test H à ZZ on KNL (scalability and memory) 16

Backup Slides

Intel Xeon vs. AMD Opteron NUMA memory nodes, sockets, shared caches cores Xeon X5650 Opteron 6128HE 18

Exclusive time: Intel Xeon OSS compare: 1 threads vs. 12 thread (% of time) Experiment with1028 events of 50 GeV pions (10.2.r06) t12 t1 No changes in call paths No significant timing perturbation (a good sanity check!) 19

AMD OSS compare: AMD 32 threads vs. 1 thread (% of time) Persistency in difference (by version, by different samples)? Experiments with 1028 events of 50 GeV e- and pi- (10.2.r07) 20

Degradation on AMD: Persistency OSS compare: AMD 32 threads vs. 1 thread (% of time) Experiment with 1028 events of 50 GeV pi- (10.2.r07) G4Navigator::LocateGloalPointAndSetup is perturbative (consistent with 10.2.r06) 21

Degradation on AMD: Persistency OSS compare: AMD 32 threads vs. 1 thread (% of time) Experiment with 1028 events of 50 GeV e- (10.2.r07) G4TouchableHistory::GetVolume 22

Geant4 MT Performance: Xeon Phi (Andrea Dotti) CMS geometry, uniform (4T) B-filed (10.2.r06) Total number of events processed = 10*(number of threads) Intel Xeon Phi 3210A (57 cores), 6GB Xron Phi (MIC) 23

Geant4MT: Profiling Result (N threads = 256) 24

Geant4MT: Profiling Result (Sequential) 25

Geant4MT: Profiling Result ( N thread = 1) 26

Geant4MT: Profiling Result (N threads = 32) 27

Geant4MT: Profiling Result (N threads = 64) 28

Geant4MT: Profiling Result (N threads = 192) 29

Geant4MT Performance on KNL: icc (16.0.3) vs. gcc (4.9.1) Performance on Intel Xeon Phi Processor (Knight's Landing ) KNL triples both scalar and vector performance compared with KNC and offers, up to 3.0 TFlop/sec (double) per processor. 30