The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Size: px

Start display at page:

Download "The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team"

Maude Webb
5 years ago
Views:

1 The GeantV prototype on KNL Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

2 Outline Introduction (Digression on vectorization approach) Geometry benchmarks: vectorization and scalability Profiling + issues Particle transport improvement Sub-node clustering + NUMA Task based approach Fast simulation using ML Improved GeantV scheduling Preliminary features IXPUG Annual Spring Conference

massively CPUbound 200 Computing centers

WLCG): 65k processor cores ; 30PB disk +

3 The problem Detailed simulation of subatomic particles in detectors, essential for data analysis, detector design.. Complex physics and geometry modeling Heavy computation requirements, massively CPUbound 200 Computing centers in 20 countries: > 600k (20% WLCG): 65k processor cores ; 30PB disk + >35PB tape storage More than 50% of WLCG power for simulations IXPUG Annual Spring Conference

GeantV Adapting simulation to modern hardware Classical simulation hard to approach the full machine potential GeantV simulation needs to profit at best from all processing pipelines Single event

4 GeantV Adapting simulation to modern hardware Classical simulation hard to approach the full machine potential GeantV simulation needs to profit at best from all processing pipelines Single event scalar transport Embarrassing parallelism Cache coherence low Vectorization low (scalar auto-vectorization) Multi-event vector transport Fine grain parallelism Cache coherence high Vectorization high (explicit multi-particle interfaces) IXPUG Annual Spring Conference

5 Tested hardware Processor Code name #cores Instruction set Xeon E GHz Ivy Bridge 2 x 12 AVX Xeon E GHz Haswell 2 x 8 AVX2 Intel Core i7 3.4 GHz Skylake 4 AVX2 Xeon Phi 1.3GHz KNL 64 AVX 512 Compilers used: different versions of icc, gcc and clang KNL memory configurations: quadrant + flat mode - tested MCDRAM in cache mode ongoing tests hot data on MCDRAM (memkind/hbwmalloc) ongoing tests IXPUG Annual Spring Conference

GeantV approach: boosting vectors Transport particles in vectors ( baskets ) Filter by geometry volume or physics process Redesign library and workflow to target fine grain parallelism

6 GeantV approach: boosting vectors Transport particles in vectors ( baskets ) Filter by geometry volume or physics process Redesign library and workflow to target fine grain parallelism Use an abstraction for vector types and their operations to achieve portable vectorization Aim for a 3x-5x faster code, understand hard limits for 10x IXPUG Annual Spring Conference

7 Why a vectorization abstraction? Performance with auto-vectorization varies wildly for different compilers and versions Intel C/C++ compiler is significantly ahead of GCC and Clang Compiler intrinsics are not an ideal interface Portability is an issue, exposure to users as well Vectorization libraries do not always work well across all architectures e.g. UME::SIMD uses scalar emulation for AVX2 (not as good as KNL) Still need portable solution when no library is available Brief digression on the subject following next IXPUG Annual Spring Conference

8 VecCore library API IXPUG Annual Spring Conference

9 VecCore backends IXPUG Annual Spring Conference

10 Code Example: Quadratic Solver IXPUG Annual Spring Conference

11 Code Example: Quadratic Solver IXPUG Annual Spring Conference

12 Code Example: Quadratic Solver IXPUG Annual Spring Conference

13 Code Example: Quadratic Solver IXPUG Annual Spring Conference

14 Performance Comparison on Skylake Make sure code vectorizes with ANY compiler! IXPUG Annual Spring Conference

15 Performance Comparison on Intel Xeon Phi AVX512 not well supported by vector libraries with non- Intel compilers in the initial version not the case anymore IXPUG Annual Spring Conference

16 A real example: Electromagnetic Physics Models IXPUG Annual Spring Conference

17 Vectors and the challenges Gather/reshuffle data into SOA, then into SIMD registers No free lunch: need to keep data gathering overheads < vector gains Run-time fraction spent in different parts of GeantV Geometry 24-core dual socket E GHz (IVB). IXPUG Annual Spring Conference

Geometry navigation on KNL X-ray scan of detector volumes Trace a grid of virtual rays through

30GHz Compare GeantV basket approach to An ideal vector case (no basketizing overheads) AVX512

ideal vector basket 120 T(single thread)/t(multi threads) Classical scalar navigation (ROOT)

18 Geometry navigation on KNL X-ray scan of detector volumes Trace a grid of virtual rays through geometry Simplified geometry emulating a tracker detector Intel Xeon Phi CPU 1.30GHz Compare GeantV basket approach to An ideal vector case (no basketizing overheads) AVX512 vectorization enforced by API (UME:SIMD backend) ~100x speedup for the ideal and basket versions ideal vector basket 120 T(single thread)/t(multi threads) Classical scalar navigation (ROOT) classical Speedup vs same(1 thread) IXPUG Annual Spring Conference NTHREADS

19 Performance 175 ideal vector vs. classical basket vs. classical GeantV gives excellent benefits with respect to ROOT in terms of speedup High vectorization intensity achieved for both ideal and basketized cases Absolute times (s) AVX-512 brings an extra factor of ~2 to our benchmark 1 AVX2 Vector ideal AVX512 T(GeantV)/T(Classical) T(AVX2)/T(AVX512) Speedup wrt multithreaded classical approach NTHREADS Vector ideal Nthreads IXPUG Annual Spring Conference 2017 Nthreads 19

20 Improving the performance Sub-node clustering with multiple propagators Improve data/processing locality and reduce contention Improved memory management in basketizing procedure (NUMA awareness) TBB-based task based version Fast simulation using ML/DL IXPUG Annual Spring Conference

improved scalability at the scale of KNL and beyond, address both many-node and multi-socket (HPC)

21 Sub-node clustering Known scalability issues of full GeantV due to synchronization in re-basketizing New approach deploying several propagators clustering resources at sub-node level Objectives: improved scalability at the scale of KNL and beyond, address both many-node and multi-socket (HPC) modes + non-homogenous resources Implemented recently - beingtested on KNL IXPUG Annual Spring Conference

NUMA awareness Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Use pinning/numa allocators to increase locality Multi-propagator

22 NUMA awareness Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Use pinning/numa allocators to increase locality Multi-propagator mode running one/more clusters per quadrant Loose communication between NUMA nodes at basketizing step Implemented, currently being tested IXPUG Annual Spring Conference

23 Multi-propagators prototype Full track transport and basketization procedure Simplified calorimeter Tabulated physics (EM processes + various materials) Scalability gets better by increasing number of propagators Not final results, still fixing/optimizing speed-up #cores Xeon Good scalability up to the number of physical cores KNL : old version KNL: 8 propagators KNL: 4 propagators Haswell (E5-2630) #cores KNL Xeon Phi GHz Haswell E GHz #threads 256 IXPUG Annual Spring Conference

24 Task based GeantV task-based external framework 0..n inject event EventServer Transport task may be further split into subtasks Initial task Top level task spawning a branch in TBB tree of tasks Transport task Transports one basket for one step reuse tracks keeping locality Prioritizer Flush baskets/prioritize events task Basketizer(s) concurrent service to regroup tracks command: dump all your baskets enqueue basket Basket queue concurrent service (workload balancing) inspect Flow control task event finished? queue empty? event finished? event finished? task-based external framework User scoring I/O task User Digitizers tasks IXPUG Annual Spring Conference

TBB tasks: preliminary results A first implementation of TBB taskbased approach on the full track transport prototype TBB preferred over OpenMP tasks due to requirements for integration with user

25 TBB tasks: preliminary results A first implementation of TBB taskbased approach on the full track transport prototype TBB preferred over OpenMP tasks due to requirements for integration with user code and other frameworks Some overheads on Haswell/AVX2, not so obvious on KNL/AVX512 Re-entrance of tasks compared to the static approach AVX2 Intel(R) Xeon(R) CPU E GHz 2 sockets x 8 physical cores Intel Xeon Phi CPU 1.30GHz IXPUG Annual Spring Conference

The full prototype Exercise at the scale of LHC experiments (CMS) Full geometry + uniform magnetic field Tabulated physics, fixed 1MeV energy threshold Full track transport and basketization

26 The full prototype Exercise at the scale of LHC experiments (CMS) Full geometry + uniform magnetic field Tabulated physics, fixed 1MeV energy threshold Full track transport and basketization procedure First results on speed-up (comparison to classical approach single-thread) T(1 thread)/t(n threads) Single thread: T GV /T classical Nthreads IXPUG Annual Spring Conference

Full prototype performance on KNL Overall we fill

supposed to vectorize) Memory access analysis

code runs as low utilisation (<12 GB/sec) Total

to regroup SIMD vectors -> cache misses +

27 Full prototype performance on KNL Overall we fill VPUs reasonably well (for function calls that are supposed to vectorize) Memory access analysis shows we are not bandwidth bound: most of the code runs as low utilisation (<12 GB/sec) Total GB/sec Read GB/sec Write GB/sec A lot of copying to regroup SIMD vectors -> cache misses + contention, high memory usage IXPUG Annual Spring Conference

28 GeantTrack * GeantV version 3: A generic vector flow approach Stage buffer Stage buffer Stage buffer Simulation stages formalize the different steps in the track propagation algorithm SimulationStage SimulationStage SimulationStage Process() virtual Select(track) loop scalar or basketized handlers for all possible actions for the stage Executor thread Handler 1 Basketizer 1 Handler 2 (active) Basketizer 2 empty basket filled by basketizer AddTrack(track, ) scalar vector e.g. ComptonFilter::DoIt virtual DoIt(track) loop SimulationStage::fFollowUps[i] default behavior to override virtual DoIt(, ) Stack-like buffer copy tracks GeantTaskData empty baskets taken from tread pool IXPUG Annual Spring Conference

Version 3 preliminary (runapp benchmark) 180 runapp /300 MeV elec. 0.1 T field/ version 3 vs. version2 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.

29 Version 3 preliminary (runapp benchmark) 180 runapp /300 MeV elec. 0.1 T field/ version 3 vs. version2 Intel(R) Xeon(R) CPU E GHz vecprot_v2 vecprot_v3 NUMA limit HT limit runapp: 9766 Mbytes RSS v Time [sec] v runapp: 320 Mbytes RSS flat #threads IXPUG Annual Spring Conference

30 Version 3 preliminary: VTune analysis Hotspots, concurrency Features of the new model IXPUG Annual Spring Conference

really used in this mode Analysis continues: Vectorization Investigating NUMA effects

31 Version 3 preliminary: VTune analysis Cache: L2 hit rate is high and L2 hit/miss bound is low for all hotspots Mem bandwidth seems OK, peaks at only 10GB/sec, MCDRAM not really used in this mode Analysis continues: Vectorization Investigating NUMA effects Hot buffers allocation on MCDRAM Cache vs. flat memory model IXPUG Annual Spring Conference

32 Going beyond x10: fast simulation In the best case scenario GeantV will give O(10) speedup O(100) is rather needed to cope with HL-LHC expected needs Improved, efficient and accurate fast simulation Currently available solutions are detector dependent Looking for a generic approach + user API A general fast simulation tool based on Machine Learning techniques ML techniques are more and more performant in different HEP fields Optimizing training time becomes crucial IXPUG Annual Spring Conference

algorithm for hyperparameters tuning and metaoptimization First 3D images of single particle showers in LCD ECAL obtained training GAN IXPUG

33 ML/DL engine for fast simulation Train on full simulation Test training on real data Test different techniques/models Multi Objective regression, Feature extraction Predictive Clustering Trees & Standard Perceptron (TMVA) Generative adversarial networks (GANs) Later: embedded algorithm for hyperparameters tuning and metaoptimization First 3D images of single particle showers in LCD ECAL obtained training GAN IXPUG Annual Spring Conference 2017 TensorFlow + Keras GAN generated: 100 GeV e Fast training by parallelizing (many-core, clusters), lower communication overhead 33

34 Conclusion GeantV delivers already a part of the expected performance Many optimization requirements, now understanding how to handle most of them GeantV dispatcher version 3 now ready tests on KNL ongoing Additional levels of locality (NUMA) available: topology detection already in GeantV, currently being integrated Exploring task-based approach: TBB-enabled version is ready Next step: integration with physics and optimization IXPUG Annual Spring Conference

35 Thank you! IXPUG Annual Spring Conference

36 Backup slides Vectorization examples using VecCore abstraction library IXPUG Annual Spring Conference

37 future R&D GeantV plans for HPC environments Standard mode (1 independent process per node) Always possible, no-brainer Possible issues with work balancing (events take different time) Possible issues with output granularity (merging may be required) Multi-tier mode (event servers) Useful to work with events from file, to handle merging and workload balancing Communication with event servers via MPI to get event id s in common files MPI Event feeder Event feeder Event server Numa 0 Numa 1 Numa 0 Numa 1 Numa 0 Numa 1 Transport Transport Transport Transport Transport Transport Node 1 Node 2 MPI Node mod[n] Merging service Event feeder Event feeder Numa 0 Numa 1 Numa 0 Numa 1 Transport Transport Transport Transport Node 1 Node 2 Event server Numa 0 Numa 1 Transport Node mod[n] Transport Merging service Event feeder Event feeder Numa 0 Numa 1 Numa 0 Numa 1 Numa 0 Numa 1 Transport Transport Transport Transport Transport Transport Node 1 Node 2 Event server Node mod[n] Merging service IXPUG Annual Spring Conference

Machine Learning for (fast) simulation

Machine Learning for (fast) simulation Sofia Vallecorsa for the GeantV team CERN, April 2017 1 Monte Carlo Simulation: Why Detailed simulation of subatomic particles is essential for data analysis, detector