Performance of a multi-physics code on Cavium ThunderX2
|
|
- Hector Eaton
- 5 years ago
- Views:
Transcription
1 Performance of a multi-physics code on Cavium ThunderX2 User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by John G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC) September 26, 2018 PETTT
2 This material is based upon work supported by, or in part by, the Department of Defense (DoD) High Performance Computing Modernization Program (HPCMP) under the User Productivity, Technology Transfer and Training (PETTT) Program, contract number GS04T09DBC0017. PETTT 2
3 Outline Aster Comparison methodology Initial performance numbers Extract serial kernel Rev 1 performance numbers Extract more kernels Characterize kernels Summary and future work PETTT 3
4 Aster 1 Direct drive inertial confinement fusion code Spherical, structured grid Two temperature explicit CFD Tabular equation of state Implicit species heat conduction Laser ray tracing Fusion reactions Multi-group radiation diffusion Operator split time stepping 1 I.V. Igumenshchev, et al. Three-Dimensional Modeling of Direct-Drive Cryogenic Implosions on OMEGA. Physics of Plasmas 23, (2016). 2 Image: PETTT 4
5 Comparison methodology Fixed problem size on one node 21.3M cells/node 1 MPI rank/core Node characteristics Two sockets 32 ranks on SKL, 2 x 125W TDP 56/64 ranks on TX2, 2 x 165W TDP Compilers Intel/Intelmpi on SKL Gcc/OpenMPI on TX2 Manufacturer Architecture Part SIMD width Peak GF/s Peak DRAM GB/s Cores Peak DRAM GB/s/core Intel Skylake Cavium ThunderX2 B Memory channels ThunderX2 B PETTT 5
6 Compare STREAM Triad and HPCG Triad peak: Peak GB/s / 24 B/triad x 2 FLOP/triad / Peak GF/s STREAM HPCG Comparison methodology Manufacturer Architecture Part Peak GF/s Peak DRAM GB/s CPU N rank Stack Read (GB/s) Write (GB/s) Triad peak efficiency (%) Total (GB/s) STREAM Triad (GB/s) Threads Intel Skylake x x Cavium ThunderX2 B0 2 x x Cavium ThunderX2 B1 2 x x FLOPS (GF/s) Peak (GF/s) Skylake 32 Intel 19.0, Intel MPI GB/s % peak ThunderX2 64 gcc 7.2, OpenMPI ThunderX2 56 gcc 7.2, OpenMPI PETTT 6
7 Initial performance numbers Ten cycles of Aster 21.3M cells Architecture Ranks Time (s) Ratio SKL TX Profile with Arm MAP mpi_recv + gauss_seidel_in_plane are largest cost, but have very similar absolute run times pow + exp have disparate run times SKL 15s TX2 60s Difference (60-15=45) is nearly ½ of the total difference ( =107) PETTT 7
8 Initial performance numbers Profile with Arm MAP TX2 Self SKL % seconds Function % seconds 13.70% mpi_recv_ 14.40% % gauss_seidel_in_plane 18.70% % pow_finite 4.60% % mpi_waitall_ 3.60% % mpi_bcast_ 1.60% % _gfortran_string_index 7.20% % mpi_send_ 3.00% % exp1 <0.1% 4.70% ppi 4.00% % get_diff_coefs 6.10% PETTT 8
9 Extract serial kernel Identified pow heavy Aster function Extract kernel using KGen Instruments application and generates verification data for kernel aster_tubr ( temperature update by radiation ) Kernel used internally at Arm to work on precision issues with armflang Arm recommends using Arm Performance Libraries (ArmPL) Found Arm Optimized-Routines (AOR) Upstream for ArmPL PETTT 9
10 Extract serial kernel aster_tubr results Run on same input data Weak scaling implies TX2 data set would be ½ size as SKL data set Best effective TX2 time ~ 1.36s compared to 0.88s on SKL Architecture Time (s) Compiler, library SKL 0.88 Intel TX gcc 7.2, default TX armflang , default TX armflang , -L${ARMPL_LIBRARIES} -lamath TX gcc 7.2, Arm Optimized-Routines -lmathlib PETTT 10
11 Rev 1 performance numbers Ten cycles of Aster 21.3M cells Architecture Ranks Time (s) Ratio SKL TX TX Profile with Arm MAP mpi_recv + mpi_waitall + mpi_send SKL 69s TX2 95s Difference (95-69=16) is nearly as large as total difference ( =19) PETTT 11
12 Rev 1 performance numbers Profile with Arm MAP TX2 Self SKL % seconds Function % seconds 13.60% gauss_seidel_in_plane 18.70% % mpi_recv_ 14.40% % mpi_waitall_ 3.60% % _gfortran_string_index 7.20% % ppi 4.00% % mpi_send_ 3.00% % get_diff_coefs 6.10% % _int_free <0.1% 3.20% log_inline [inlined] 0.50% % get_residual PETTT 12
13 Extract more kernels gauss_seidel_in_plane Most expensive function Called many times during l-cycles and V-cycles with variable sized input data for fine and coarse grids Tridiagonal solver in radial direction introduces MPI sweep like dependency Characterization useful, but not as important as multigrid Many calls to gauss_seidel_in_plane Accounts for 48% inclusive time PETTT 13
14 Extract more kernels gauss_seidel_in_plane Architecture Ranks Time (s) Ratio SKL TX multigrid Architecture Ranks Time (s) Ratio SKL TX TX MPI in multigrid Function SKL (s) TX2 64 (s) TX2 56 mpi_recv mpi_waitall mpi_send Total PETTT 14
15 Run multigrid kernel through Intel VTune on SKL to determine performance characterization Intel performance analysis tools provide extensive detail multigrid kernel is memory bound on SKL 65% of pipeline slots stalled due to load/store ~10% clock ticks stalled on cache 35% clock ticks stalled on DRAM Characterize kernels 41% clock ticks stalled for DRAM bandwidth boundedness 16% clock ticks stalled for DRAM latency PETTT 15
16 Characterize kernels Multigrid Time vs Arithmetic Intensity Low arithmetic intensity implies memory bandwidth will be limiting factor PETTT 16
17 Characterize kernels Multigrid roofline on SKL Heavy vertical lines show bounds of measured arithmetic intensity PETTT 17
18 Characterize kernels Based on DRAM bandwidth boundedness, expect higher aggregate bandwidth to run code faster Would like to measure effective bandwidth on Arm Histogram shows MPI imbalance due to sweep dependency of tridiagonalsolver Larger number of ranks on TX2 as SKL exacerbates sweep dependency Number of ranks in angular dimensions stays same, only sweep direction increases in ranks PETTT 18
19 Multigrid on four CPU architectures Single node performance for multigrid kernel Available memory bandwidth has large impact on performance Intel VTune measured 41% clock ticks limited by DRAM bandwidth boundedness More work needed to understand discrepancy between TX2 and EPYC CPU Bandwidth (GB/s) Measured kernel time (s) Broadwell Aster time (s) Skylake ThunderX EPYC PETTT 19
20 Summary and future work Node level results for Aster code are encouraging Initially disparate results were reconciled through profiling and finding correct math libraries Codes that are clearly bandwidth bound might be expected to perform similarly on TX2 and SKL Shared memory byte transport layers show similar bandwidths and latencies when measured with micro-benchmarks Additional latencies appear to be present in Aster and the extracted kernels, which requires further study Preparing Aster to run on Astra Will perform multi-node scaling studies next Sweep algorithm needs to be studied for improvement Will benefit both SKL and TX2 PETTT 20
21 Additional Material PETTT 21
22 Xeon Gold GHz 16 cores, 32 threads Max turbo frequency: 3.7 GHz 22 MB L3 cache TDP 125 W Max memory speed: 2666 MHz Number of AVX-512 FMA Units: 2 Max number of memory channels: 6 Single: GiB/s [= (64/8/ ) GiB x 2666 MHz], [= 21.3 GB/s] Intel Skylake Double: GiB/s [= 42.7 GB/s] Quad: GiB/s [= 85.4 GB/s] Hexa: GiB/s [= GB/s] PETTT 22
23 Floating point capacity 2 x 512 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 2 VPU/core x 8 reals = 32 FLOP/cycle/core 32 FLOP/cycle/core x [ ] GHz = [ ] GF/s/core [ ] GF/s/core x 16 cores = [ ] GF/s Single thread measurement 1 : 110 GF/s/core Single precision 1 Intel Advisor 19 Intel Skylake 2 FLOP/VPU/cycle x 2 VPU/core x 16 reals = 64 FLOP/cycle/core 64 FLOP/cycle/core x [ ] GHz = [ ] GF/s/core [ ] GF/s/core x 16 cores = [ ] GF/s Single thread measurement 1 : 220 GF/s/core PETTT 23
24 Cavium Thunder X2 2.2 GHz B0 stepping Some specs are best guess based on public information and A2 stepping 32 cores, 64 threads (up to 128 threads) Max turbo frequency? GHz 32 MB L3 cache TDP 165 W Max memory speed: 2666 MHz Max number of DDR4 memory channels: 8 Single: GiB/s [= (64/8/ ) GiB x 2666 MHz], [= 21.3 GB/s] Cavium ThunderX2 Double: GiB/s [= 42.7 GB/s] Quad: GiB/s [= 85.4 GB/s] Hexa: GiB/s [= GB/s] Octo: GiB/s [= GB/s] PETTT 24
25 Floating point capacity 1 2 x 128 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 32 cores = 563 GF/s Single thread measurement:? GF/s/core Single precision Cavium ThunderX2 (32 core) 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 32 cores = 1126 GF/s Single thread measurement:? GF/s/core 1 Based on Broadcom Vulcan PETTT 25
26 Floating point capacity 1 2 x 128 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 28 cores = 493 GF/s Single thread measurement:? GF/s/core Single precision Cavium ThunderX2 (28 core) 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 28 cores = 986 GF/s Single thread measurement:? GF/s/core 1 Based on Broadcom Vulcan PETTT 26
SNAP Performance Benchmark and Profiling. April 2014
SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationIntel profiling tools and roofline model. Dr. Luigi Iapichino
Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed
More informationTrends in systems and how to get efficient performance
Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant martin.hilgeman@dell.com The landscape is changing We are no longer in the general purpose era the argument of tuning
More informationOpenFOAM Performance Testing and Profiling. October 2017
OpenFOAM Performance Testing and Profiling October 2017 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Huawei, Mellanox Compute resource - HPC
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationDesigning for Performance. Patrick Happ Raul Feitosa
Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationPerformance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers
Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers This white paper details the performance improvements of Dell PowerEdge servers with the Intel Xeon Processor Scalable CPU
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationComparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard
Prof Simon McIntosh-Smith Isambard PI University of Bristol / GW4 Alliance Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard Isambard system specification 10,000+
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationPreliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths
Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationAQD-SD4U4GN21-HG Test Report
AQD-SD4U4GN21-HG Test Report Initiated by Brandon Lin Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2016-3-1 1.0 First version released Brandon Lin Page
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationEvaluation of Intel Memory Drive Technology Performance for Scientific Applications
Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing
More informationAQD-D4U8GN21-HG Test Report
AQD-D4U8GN21-HG Test Report Initiated by Mila Chen Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2015-12-28 1.0 First version released Mila Chen Page
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationOptimizing Fusion PIC Code XGC1 Performance on Cori Phase 2
Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 T. Koskela, J. Deslippe NERSC / LBNL tkoskela@lbl.gov June 23, 2017-1 - Thank you to all collaborators! LBNL Brian Friesen, Ankit Bhagatwala,
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationFusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!
Fusion PIC Code Performance Analysis on the Cori KNL System T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel! tkoskela@lbl.gov May 18, 2017-1- Outline Introduc3on to magne3c fusion
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationWhat s P. Thierry
What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and
More informationPower Measurement Using Performance Counters
Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power
More informationToward Automated Application Profiling on Cray Systems
Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationArm Processor Technology Update and Roadmap
Arm Processor Technology Update and Roadmap ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationLS-DYNA Performance on Intel Scalable Solutions
LS-DYNA Performance on Intel Scalable Solutions Nick Meng, Michael Strassmaier, James Erwin, Intel nick.meng@intel.com, michael.j.strassmaier@intel.com, james.erwin@intel.com Jason Wang, LSTC jason@lstc.com
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationTechnologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017
Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More information28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133
Intel Core i7 is a brand new architecture featuring the QPI bus which replaces the FSB bus. So, how does this affect overclocking? The Core i7 processor s frequency is Bclk * CPU multiplier. For ex. Intel
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationGRID Testing and Profiling. November 2017
GRID Testing and Profiling November 2017 2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware
More informationEPYC VIDEO CUG 2018 MAY 2018
AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationPerformance of the AMD Opteron LS21 for IBM BladeCenter
August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the
More informationToward Building up Arm HPC Ecosystem --Fujitsu s Activities--
Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of
More informationAtos ARM solutions for HPC
Atos ARM solutions for HPC Eric Eppe Head of Solution Marketing & Portfolio HPC & Quantum Global Business Line Tuesday, March 7th, HPC User Forum, TERATEC Atos HPC and ARM A long time engagement 2012 2013
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationCluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationFABRICATION TECHNOLOGIES
FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationBlue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft
Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationSCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING
2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality
More informationAcceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?
Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationMain Memory Supporting Caches
Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Cache Issues 1 Example cache block read
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationCOSC 6374 Parallel Computation. Performance Modeling and 2 nd Homework. Edgar Gabriel. Spring Motivation
COSC 6374 Parallel Computation Performance Modeling and 2 nd Homework Spring 2007 Motivation Can we estimate the costs for a parallel code in order to Evaluate quantitative and qualitative differences
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationGROMACS Performance Benchmark and Profiling. August 2011
GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More information2
1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationComposite Metrics for System Throughput in HPC
Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last
More information