Getting the best performance from massively parallel computer

Similar documents
Programming for Fujitsu Supercomputers

Technical Computing Suite supporting the hybrid system

PRIMEHPC FX10: Advanced Software

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Fujitsu s Approach to Application Centric Petascale Computing

Findings from real petascale computer systems with meteorological applications

Introduction of Fujitsu s next-generation supercomputer

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

The way toward peta-flops

Post-K: Building the Arm HPC Ecosystem

Optimising for the p690 memory system

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Fujitsu High Performance CPU for the Post-K Computer

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Parallel Numerical Algorithms

HOKUSAI System. Figure 0-1 System diagram

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Memory Subsystem Profiling with the Sun Studio Performance Analyzer

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Intel Xeon Phi архитектура, модели программирования, оптимизация.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Fujitsu s Technologies to the K Computer

Compiler Technology That Demonstrates Ability of the K computer

FUJITSU HPC and the Development of the Post-K Supercomputer

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Parallel Computing Why & How?

Profiling: Understand Your Application

Martin Kruliš, v

Bei Wang, Dmitry Prohorov and Carlos Rosales

ECE 574 Cluster Computing Lecture 16

A Lightweight OpenMP Runtime

Introduction to Parallel Computing

2 TEST: A Tracer for Extracting Speculative Threads

Carlo Cavazzoni, HPC department, CINECA

Mapping MPI+X Applications to Multi-GPU Architectures

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Kaisen Lin and Michael Conley

Optimisation p.1/22. Optimisation

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Challenges in Developing Highly Reliable HPC systems

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance of Multicore LUP Decomposition

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

The Tofu Interconnect 2

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

6.1 Multiprocessor Computing Environment

Parallel Programming in C with MPI and OpenMP

Overview of Tianhe-2

COSC 6385 Computer Architecture - Multi Processor Systems

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Modern Processor Architectures. L25: Modern Compiler Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction to OpenMP

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Jackson Marusarz Intel Corporation

CS560 Lecture Parallel Architecture 1

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Our new HPC-Cluster An overview


Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallel Programming on Larrabee. Tim Foley Intel Corp

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Compiling for Advanced Architectures

CS 351 Final Exam Solutions

Shared Memory programming paradigm: openmp

High Performance Computing in C and C++

The Tofu Interconnect D

CPU Architecture. HPCE / dt10 / 2013 / 10.1

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Master Informatics Eng.

Cray RS Programming Environment

Transcription:

Getting the best performance from massively parallel computer June 6 th, 2013 Takashi Aoki Next Generation Technical Computing Unit Fujitsu Limited

Agenda Second generation petascale supercomputer PRIMEHPC FX10 Tuning techniques for PRIMEHPC FX10 Programming Language, MPI Programming tool Math. Lib. Intra Node Fortran 2003 C C++ OpenMP 3.0 Insts. level opt. Instruction scheduling SIMDization Loop level opt. Automatic Parallelization IDE Debugger Profiler BLAS LAPACK SSL II Inter Node XPFortran *1 MPI 2.1 RMATT *2 ScaLAPACK 1/40 *1: extended Parallel Fortran (Distributed Parallel Fortran) *2: Rank Map Automatic Tuning Tool

PRIMEHPC FX10 System Configuration PRIMEHPC FX10 SPARC64 TM IXfx CPU DDR3 memory ICC (Interconnect Control Chip) Compute node configuration Compute Nodes Management servers Portal servers IO Network Tofu interconnect for I/O I/O nodes Local disks Local file system Network (IB or GbE) File servers Global disk Login server Global file system IB: InfiniBand GB: GigaBit Ethernet 2/40

FX10 System H/W Specifications CPU Node SPARC64 TM IXfx CPU 16 cores/socket 236.5 GFlops PRIMEHPC FX10 H/W Specifications Name Performance Configuration Memory capacity SPARC64 TM IXfx 236.5GFlops@1.848GHz 1 CPU / Node 32, 64 GB Rack Performance/rack 22.7 TFlops System (4 ~1024 racks) No. of compute node 384 to 98,304 Performance Memory System rack 96 compute nodes 6 I/O nodes With optional water cooling exhaust unit 90.8TFlops to 23.2PFlops 12 TB to 6 PB System board 4 nodes (4 CPUs) 3/40 System Max. 23.2 PFlops Max. 1,024 racks Max. 98,304 CPUs

The K computer and FX10 Comparison of System H/W Specifications K computer FX10 CPU Node Name SPARC64 TM VIIIfx SPARC64 TM IXfx Performance 128GFlops@2GHz 236.5GFlops@1.848GHz Architecture Cache configuration SPARC V9 + HPC-ACE extension L1(I) Cache:32KB/core, L1(D) Cache:32KB/core L2 Cache: 6MB(shared) L2 Cache: 12MB(shared) No. of cores/socket 8 16 Memory band width 64 GB/s. 85 GB/s. Configuration 1 CPU / Node Memory capacity 16 GB 32, 64 GB System board Node/system board 4 Nodes Rack System board/rack 24 System boards Performance/rack 12.3 TFlops 22.7 TFlops 4/40

The K computer and FX10 Comparison of System H/W Specifications (cont.) Interconnect K computer FX10 Topology 6D Mesh/Torus Performance 5GB/s x2 (bi-directional) No. of link per node 10 Cooling Additional features CPU, ICC(interconnect chip), DDCON Other parts H/W barrier, reduction no external switch box Direct water cooling Air cooling Air cooling + Exhaust air water cooling unit (Optional) 5/40

Programming Environment User Client Login Node FX10 System Compute Nodes IDE Interface IDE Command Interface Job Control debugger Interactive Debugger GUI Debugger Interface Data Converter App App Data Sampler Profiler Visualized Data Sampling Data 6/40

Tools for high performance computing Hardware Massively parallel supercomputer SPARC64 TM IXfx Tofu interconnect Software Parallel compiler PA(Performance Analysis) information Low jitter Operating System Distributed File System 7/40

Tuning Techniques for FX10 Parallel programing style Hybrid parallel Scalar tuning Parallel tuning False sharing Load imbalance 8/40

Parallel Programming for Busy Researchers Large number of parallelism for large scale systems Large number processes need large memory and overhead Hybrid thread-process programming to reduce number of processes Hybrid parallel programming is annoying for programmers Even for multi-threading, the coarser grain the better Procedure level or outer loop parallelism is desired Little opportunity for such coarse grain parallelism System support for fine grain parallelism is required VISIMPACT solves these problems 9/40

Hybrid Parallel Hybrid Parallel vs. Flat MPI Hybrid Parallel: MPI parallel between CPUs Thread parallel inside CPU (between cores) Flat MPI: MPI parallel between cores VISIMPACT (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Mechanism that treats multiple cores as one CPU through automatic parallelization Hardware mechanisms to support hybrid parallel Software tools to realize hybrid parallel automatically Flat MPI Hybrid Parallel MPI VISIMPACT MPI Automatic threading Hardware barrier Shared L2 cache 16 Process x 1 Thread 4 Process x 4 Thread 10/40

Drawbacks Merits Merits and Drawbacks of Flat MPI and Hybrid Parallel Flat MPI Program portability Need memory communication buffer large page fragmentation MPI message passing time increase Hybrid Parallel Reduced memory usage Performance less process = less MPI message trans. time thread performance can be improved by VISIMPACT Need two level parallelization Flat MPI Hybrid Parallel Node Node Process Data Area Process Data Area Process Data Area Process Data Area Process Data Area Data per thread Comm. Buffer Comm. Buffer Comm. Buffer Comm. Buffer Comm. Buffer 11/40

Characteristics of Hybrid Parallel Optimum number of processes and threads depends on application characteristics. For example, thread parallel scalability, MPI communication ratio and process load imbalance are involved. Conceptual Image: performance change of different combination of process and thread Slower High Application A mostly processed in parallel has a lot of communication has a big load imbalance has a high L2 cache reuse rate Faster Low 16P 1T 8P 2T 4P 4T 8P 2T 1P 16T Application B mostly processed in serial Application C has a characteristic between A and B 12/40

Impact of Hybrid Parallel : example 1 & 2 Time ratio Time ratio Memory per Node [GB] Memory per Node [GB] 1.2 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Time ratio(1 for 16t) Memory per Node 1536p1t 1536p2t 1536p4t 1536p8t 1536p16t Procs. & Threads Time ratio(1 for 16t) Memory per Node 1536p1t 768p2t 384p4t 192p8t 96p16t Procs. & Threads 50 45 40 35 30 25 20 15 10 5 0 20 18 16 14 12 10 8 6 4 2 0 13/40 Application A MPI + automatic parallelize Meteorology application Flat MPI (1536 processes- 1 thread) doesn t work due to memory limit Granularity of a thread is small Application B MPI + automatic parallelize + OpenMP Meteorology application Flat MPI (1536 processes- 1 thread) doesn t work due to memory limit Communication cost is 15% 72% of the process is done by thread parallel

Time ratio Time ratio Impact of Hybrid Parallel : example 3 & 4 Memory per Node [GB] Memory per Node [GB] 1.4 Time ratio(1 for 16t) 1.2 Memory per Node 1.0 0.8 0.6 0.4 0.2 0.0 1536p1t 768p2t 384p4t 192p8t 96p16t Procs. & Threads 3.0 2.5 2.0 1.5 1.0 0.5 0.0 4096p1t 2048p2t 1024p4t 512p8t 256p16t Procs. & Threads 6 5 4 3 2 1 0 30 25 20 15 10 5 0 14/40 Application C MPI + OpenMP Meteorology application Load imbalance is eased by hybrid parallel Communication cost is 15% 87% of the process is done by thread parallel Application D MPI + automatic parallelize NPB.CG benchmark Load imbalance is eased by hybrid parallel Communication cost is 20%~30% 92% of the process is done by thread parallel

Application Tuning Cycle and Tools Job Information Profiler Vampir-trace RMATT Tofu-PA Profiler snapshot Execution MPI Tuning CPU Tuning Overall Tuning FX10 Specific Tools Open Source Tools Profiler Vampir-trace PAPI 15/40

PA(Performance Analysis) reports Performance Analysis reports: elapsed time calculation speed(flops) cache/memory access statistical information Instruction count Load balance Cycle accounting Cycle Accounting data: performance bottleneck identification systematic performance tuning Elapsed Time, Calculation Speed/Efficiency Memory/Cache Throughput SIMDize Ratio Cache Miss Count/Rate TLB Miss Count/Rate Instruction Count/Rate Indicator Load Balance Cycle Accounting 16/40

Execution Time(measured) Cycle Accounting Cycle Accounting is a technique to analyse performance bottleneck SPARC64 TM IXfx can measure a lot of performance analysis events Summarize the execution time for each instruction commit count Overview Instruction commit count 4 3/2 Restriction factor Max commit Integer register write wait 16 14 12 4 instructions commit 1 Various reasons 10 2-3 instructions commit Memory access wait 8 1 instruction commit Cache access wait 6 4 0 instruction commit 0 Floating point execution wait 2 0 Store wait Instruction fetch wait Other 1-4 instruction(s) commit: time to execute N instruction(s) in one machine cycle 0 instruction commit: stall time due to some reason 17/40

PA information usage Understanding the bottleneck The bottleneck of a focusing interval (exclude input, output and communication) could be estimated from PA information of the overall interval Overall Interval focusing overall interval 9 procedure 8 7 6 4 Commits 2/3 Commits 1 Commit Bottleneck 1 loop 5 4 3 FP Operation Wait Bottleneck 2 2 1 FP Cache Load Wait Utilization for tuning 0 Before Tuning By breaking down to a loop level and capture the PA information of the loop, you can find out what you can do to improve the bottleneck or how far you can improve the performance 18/40

Hot spot break down PA graph of different hot spots Possible to check the bottleneck level 4.0 Hot spot 1 3.5 Hot spot 2 8 7 6 5 4 3 2 1 0 PA graph of a whole measured interval 9 Overall Interval 4 Commits 2/3 Commits 1 Commit FP Operation Wait FP Cache Load Wait Before Tuning Breakdown to hot spot intervals 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.2 1.0 0.8 0.6 0.4 0.2 0.0 4 Commits 1 Commit FP Operation Wait Before Tuning Hot spot 3 4 Commits 2/3 Commits 1 Commit 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 FP Memory Load Wait Before Tuning Hot spot 4 1 Commit FP Operation Wait Before Tuning Before Tuning 19/40

Analyse and Diagnose(Hot spot 1: if-statement in a loop) 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 PA graph Hot spot 1 4 Commits 1 Commit FP Operation Wait Before Tuning Phenomenon Long Operation Wait Source List 151 1!$omp do <<< Loop-information Start >>> <<< [OPTIMIZATION] <<< PREFETCH : 24 <<< a: 12, b: 12 <<< Loop-information End >>> 152 2 p 6s do i=1,n1 153 3 p 6m if (p(i) > 0.0) then 154 3 p 6s b(i) = c0 + a(i)*(c1 + a(i)*(c2 + a(i)*(c3 + a(i)* 155 3 & (c4 + a(i)*(c5 + a(i)*(c6 + a(i)*(c7 + a(i)* 156 3 & (c8 + a(i)*c9)))))))) 157 3 p 6v endif 158 2 p 6v enddo 159 1!$omp enddo Diagnosis Analysis No software pipelining or SIMDizing due to inner loop if-statement Tuning required -> Eliminate the inner loop if-statement 20/40

Analyse and Diagnose(Hot spot 2: Stride Access) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 PA graph Hot spot 2 FP Memory Load Wait Before Tuning 176 1!$omp do 177 2 p do j=1,n2 Phenomenon Long Memory Access Wait Source List <<< Loop-information Start >>> <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING <<< Loop-information End >>> 178 3 p 6v do i=1,n1 179 3 p 6v b(i,j) = c0 + a(j,i)*(c1 + a(j,i)*(c2 + a(j,i)*(c3 + a(j,i)* 180 3 & (c4 + a(j,i)*(c5 + a(j,i)*(c6 + a(j,i)*(c7 + a(j,i)* 181 3 & (c8 + a(j,i)*c9)))))))) 182 3 p 6v enddo 183 2 p enddo 184 1!$omp enddo Diagnosis Analysis The accessing pattern of array b is sequent but that of a is stride. This reduces the cache utilization rate. L1D miss rate L2 miss rate 53.01% 53.04% Tuning required -> Improve cache utilization rate of array a 21/40

Analyze and Diagnose(Hot spot 3: Ideal Operation) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 PA graph Hot spot 3 4 Commits 2/3 Commits 1 Commit Before Tuning Phenomenon Source List 201 1!$omp do Long instruction commit Most of them are plural <<< Loop-information Start >>> <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING <<< Loop-information End >>> 202 2 p 6v do i=1,n1 203 2 p 6v b(i) = c0 + a(i)*(c1 + a(i)*(c2 + a(i)*(c3 + a(i)* 204 2 & (c4 + a(i)*(c5 + a(i)*(c6 + a(i)*(c7 + a(i)* 205 2 & (c8 + a(i)*c9)))))))) 206 2 p 6v enddo 207 1!$omp enddo Diagnosis Analysis Both SIMDize ratio and SIMD multiply add instruction ratio are high SIMDize ratio SIMD multiply add instruction ratio 97.25% 79.53% Tuning NOT required -> Instruction level parallelization is highly done 22/40

analyse and Diagnose(Hot spot 4: Data Dependency) Analysis 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 PA graph Source List Statement a(i)=a(i-1) makes data dependency between iteration. This Phenomenon Long Operation causes no software pipelining and no SIMDizing. Wait Hot spot 4 1 Commit FP Operation Wait Before Tuning 224 1!$omp do 225 2 p 6s do i=2,n1 226 2 p 6s a(i) = c0 + a(i-1)*(c1 + a(i-1)*(c2 + a(i-1)*(c3 + a(i-1)* 227 2 & (c4 + a(i-1)*(c5 + a(i-1)*(c6 + a(i-1)*(c7 + a(i-1)* 228 2 & (c8 + a(i-1)*c9)))))))) 229 2 p 6s enddo 230 1!$omp enddo Diagnosis No way to tune -> Need to change the algorithm 23/40

Tuning Result (Hot spot 1): if-statement Hot spot 1in a loop Hot spot 1: use mask Hot instruction spot 1 instead of if-statement 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 4 Commits 1 Commit FP Operation Wait 4.0E+00 3.5E+00 3.0E+00 2.5E+00 2.0E+00 1.5E+00 1.0E+00 5.0E-01 0.0E+00 Before Tuning 18% After Tuning Execution time(sec) FP op. peak ratio SIMD inst. ratio (/all inst.) SIMD inst. ratio (/SIMDizable inst.) Number of inst. Before Tuning 3.467 9.90% 0.00% 0.00% 9.46E+10 After Tuning 0.631 60.11% 87.79% 99.98% 3.79E+10 24/40

Tuning Result (Hot spot 2): Stride Access Apply loop blocking (divide into blocks which fit the cache size) 3.5 do j=1, n2 do i=1, n1 3.5E+00 Hot spot 2 do jj=1, n2, 16 do ii=1, Hot n1, spot 96 2 do j=jj, min(jj+16-1, n2) do i=ii, min(ii+96-1, n1) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 FP Memory Load Wait 3.0E+00 2.5E+00 2.0E+00 1.5E+00 1.0E+00 5.0E-01 0.0E+00 Before Tuning 23% After Tuning Execution time(sec) FP op. peak ratio L1D miss ratio L2 miss ratio Before Tuning 2.874 3.07% 53.01% 53.04% After Tuning 0.658 13.30% 6.69% 4.29% 25/40

Tuning Result (Overall) 9 8 7 6 5 4 3 2 1 0 Overall Interval 4 Commits 2/3 Commits 1 Commit FP Operation Wait FP Cache Load Wait Before Tuning 9.0E+00 8.0E+00 7.0E+00 6.0E+00 5.0E+00 4.0E+00 3.0E+00 2.0E+00 1.0E+00 0.0E+00 39% Overall Interval After Tuning 26/40

Scalar Tuning Technique Execution Time (measured) Chose tuning methods according to the cycle accounting result Breakdown of Execution Time Major Tuning Technique Execution Instruction Reduction SIMDize Common subexpression elimination Execution Wait Cache Wait Memory Wait Instruction Scheduling/Software pipelining Masked execution of if statement Loop unrolling Loop fission Efficient L1 cache use Padding, Loop fission, Array merge L2 cache latency hiding L1 cache prefetch(stride/list access) Efficient L2 cache use Loop blocking Outer loop fusion, Outer loop unrolling Memory latency hiding L2 cache prefetch(stride/list access) 27/40

Scalar Tuning Techniques Criteria Execution Data Technique Speed Up Example mask instruction x1.49 loop peeling x2.01 explicit data dependency hint x1.46 x2.63 loop interchange x3.74 loop fusion x1.56 loop fission x2.69 array merge x2.98 array index interchange x2.99 array data padding x2.84 28/40

False Sharing Source Program 1 subroutine sub(s,a,b,ni,nj) 2 real*8 a(ni,nj),b(ni,nj) 3 real*8 s(nj) nj=4 4 ni=2000 5 1 pp do j = 1, nj 6 1 p s(j)=0.0 7 2 p 8v do i = 1, ni 8 2 p 8v s(j)=s(j)+a(i,j)*b(i,j) 9 2 p 8v end do 10 1 p end do 11 12 end Initial State Each thread reads same cache line containing s(1)~s(4) L1 cache Example of 4 threads Thread 0 (core 1) s(1)~s(4) Cache holds the data by line size Thread 1 (core 2) Thread 2 (core 3) Thread 3 (core 4) s(1)~s(4) s(1)~s(4) s(1)~s(4) Thread 0 updates s(1) 1. Cache hit 2. Thread 0 completes s(1) update 3. Invalidate the cache lines of thread 1-3 to keep the data coherent 2.Update 1. Cache hit Thread 0 (core 1) Thread 1 (core 2) Thread 2 (core 3) Thread 3 (core 4) L1 cache s(1)~s(4) 3. Invalidate 3. Invalidate 3. Invalidate L2 cache Thread 1 updates s(2) 1. Cache miss 2. Copy back cache line from thread 0 to thread 1 3. Thread 1 completes s(2) update 4. Invalidate the cache line of thread 0 to keep the data coherent Thread 0 (core 1) L1 cache 4. Invalidate Thread 1 (core 2) s(1)~s(4) 1. Cache miss Thread 2 (core 3) Invalidate Thread 3 (core 4) Invalidate L2 cache 2. Copy Back 3.Update L2 cache Performance degrades as every thread repeats this state transition 29/40

False sharing outcome (before tuning) Because the parallelized index j runs from 1 to 8 (too small), every threads share the same cache line of array a. This causes a false sharing. PA data shows a lot of data access wait occurred. Source code before tuning 20 subroutine sub() 21 integer*8 i,j,n 22 parameter(n=30000) 23 parameter(m=8) 24 real*8 a(m,n),b(n,m) 25 common /com/a,b 26 <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 2 <<< Loop-information End >>> 7 6 5 4 3 Barrier Wait FP Cache Load Wait 27 1 pp do j=1,m <<< Loop-information Start >>> <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING parallelized index runs only 8 2 1 0 Store Wait <<< Loop-information End >>> 28 2 p 8v do i=1,n Before Tuning 29 2 p 8v a(j,i)=b(i,j) 30 2 p 8v enddo 31 1 p enddo 32 33 End False sharing occurs here L1D miss ratio Before tuning 29.53% 30/40

False sharing tuned By doing loop interchange, the false sharing can be avoided. This reduces L1 cache miss and improve the data access wait. Source code after tuning (source level tuning) 20 subroutine sub() 21 integer*8 i,j,n 7 22 parameter(n=30000) 23 parameter(m=8) 24 real*8 a(m,n),b(n,m) 25 common /com/a,b 26 <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 2 <<< [OPTIMIZATION] <<< PREFETCH : 4 <<< b: 2, a: 2 <<< Loop-information End >>> 27 1 pp do i=1,n Parallelize the second index by loop interchange 6 5 4 3 Barrier Wait FP Cache Load Wait 27% <<< Loop-information Start >>> <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING <<< Loop-information End >>> 28 2 p 8v do j=1,m Avoid false sharing 2 1 Store Wait 29 2 p 8v a(j,i)=b(i,j) 30 2 p 8v enddo 0 31 1 p enddo 32 Before Tuning After Tuning 33 End L1D miss ratio After tuning 7.89% 31/40

Triangular loop Triangular loop is a loop that has an inner loop initial index value driven by the outer loop index variable. If you divide this loop into blocks and run in parallel, you will get a load imbalance. Example subroutine sub() integer*8 i,j,n parameter(n=512) real*8 a(n+1,n),b(n+1,n),c(n+1,n) common a,b,c i j Load imbalance occurs as the processing quantity of thread 0 is largest and that of thread 15 is smallest. Processing Quantity!$omp parallel do do j=1,n do i=j,n a(i,j)=b(i,j)+c(i,j) enddo enddo end The initial index value of inner loop is determined by the outer loop variable. T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 0.0035 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 Barrier Synchronization Wait T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 32/40

Triangular loop load imbalance tuning By adding openmp directive, schedule(static, 1), loop size becomes small and loops are assigned to each thread in cyclic manner. This assignes almost same job quantity to each thread and reduces load imbalance. Modified code 28 subroutine sub() 29 integer*8 i,j,n 30 parameter(n=512) 31 real*8 a(n+1,n),b(n+1,n),c(n+1,n) 32 common a,b,c 33 34!$omp parallel do schedule(static,1) 35 1 p do j=1,n 36 2 p 8v do i=j,n 37 2 p 8v a(i,j)=b(i,j)+c(i,j) 38 2 p 8v enddo 39 1 p enddo 40 41 end T0 T4 T8 T12 T0 T4 T8 T12 T0 T4 T8 T12 T0 T4 T8 T12 T1 T5 T9 T13 T1 T5 T9 T13 T1 T5 T9 T13 T1 T5 T9 T13 T2 T6 T10 T14 T2 T6 T10 T14 T2 T6 T10 T14 T2 T6 T10 T14 T3 T7 T11 T15 T3 T7 T11 T15 T3 T7 T11 T15 T3 T7 T11 T15 0.0035 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 33/40

Loop with if-statement When the processing quantity of each thread is different, say the loop contains an if-statement, load imbalance can NOT be resolved by a static cyclic divide. Example 1 subroutine sub(a,b,s,n,m) 2 real a(n),b(n),s 3!$omp parallel do schedule(static,1) 4 1 p do j=1,n 5 2 p if( mod(j,2).eq. 0 ) then 6 3 p 8v do i=1,m 7 3 p 8v a(i) = a(i)*b(i)*s 8 3 p 8v enddo 9 2 p endif 10 1 p enddo 11 end subroutine sub : 21 program main 22 parameter(n=1000000) 23 parameter(m=100000) 24 real a(n),b(n) 25 call init(a,b,n) 26 call sub(a,b,2.0,n,m) 27 end program main Only odd threads execute then clause 10 9 8 7 6 5 4 3 2 1 0 Before Tuning Barrier synchronization wait T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 34/40

Using dynamic scheduling to reduce load imbalance By changing the thread schedule method to dynamic, a thread which finishes its execution earlier can execute the next iteration. This reduces the load imbalance. Modified code 1 subroutine sub(a,b,s,n,m) 2 real a(n),b(n),s 3!$omp parallel do schedule(dynamic,1) 4 1 p do j=1,n 5 2 p if( mod(j,2).eq. 0 ) then 6 3 p 8v do i=1,m 7 3 p 8v a(i) = a(i)*b(i)*s 8 3 p 8v enddo 9 2 p endif 10 1 p enddo 11 end subroutine sub : 21 program main 22 parameter(n=1000000) 23 parameter(m=100000) 24 real a(n),b(n) 25 call init(a,b,n) 26 call sub(a,b,2.0,n,m) 27 end program main Before Tuning After Tuning 10 9 8 7 6 5 4 3 2 1 0 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 10 9 8 7 6 5 4 3 2 1 0 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 35/40

Choosing a right parallelize loop If the iteration count is too small to parallelize, a load imbalance occurs. 34 1 pp do k=1,l Example 35 2 p do j=1,m 36 3 p 8v do i=1,n 37 3 p 8v a(i,j,k)=b(i,j,k)+c(i,j,k) 38 3 p 8v enddo 39 2 p enddo 40 1 p enddo l=2 m=256 n=256 0.012 0.010 改善前 0.008 0.006 0.004 0.002 Barrier Synchronization Wait バリア同期待ち 0.000 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Modified code 33!ocl serial 34 1 do k=1,l 35 1!ocl parallel 36 2 pp do j=1,m 37 3 p 8v do i=1,n 38 3 p 8v a(i,j,k)=b(i,j,k)+c(i,j,k) 39 3 p 8v enddo 40 2 p enddo 41 1 enddo 0.012 0.010 0.008 0.006 0.004 0.002 0.000 36/40 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15

Compiler option to choose a right iteration By using a compiler option Kdynamic_iteration, an appropriate iteration is chosen at runtime and the load imbalance is reduced. Example <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 2 <<< Loop-information End >>> 34 1 pp do k=1,l <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 4 <<< Loop-information End >>> 35 2 pp do j=1,m <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 728 <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING <<< Loop-information End >>> 36 3 pp 8v do i=1,n 37 3 p 8v a(i,j,k)=b(i,j,k)+c(i,j,k) 38 3 p 8v enddo 39 2 p enddo 40 1 p enddo 41 42 end l=2 m=256 n=256 0.012 0.010 0.008 0.006 0.004 0.002 0.000 Though it tries to execute the outer loop in parallel at first, the iteration count k, which is 2, is too small to execute in parallel. So the inner loop, whose count is 256, is executed in parallel instead. バリア同期待ち T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 37/40

OS jitter problem for parallel processing Due to synchronization, each computation is prolonged to the duration of the slowest process. Even if the job size of each process is exactly the same, OS interferes the application and time varies Ideal proc #0 proc #1 job job sync job job sync job job sync Reality proc #0 job noise job wait prolonged noise job proc #1 job wait job noise job job wait sync sync sync OS tuning and hybrid parallel can reduce OS jitter. 38/40

OS jitter measured OS jitter (noise) measured using a program called FWQ developed by Lawrence Livermore National Laboratory. https://asc.llnl.gov/sequoia/benchmarks/ t_fwq w 18 n 20000 t 16 -w: workload -n: repeat time -t: number of thread PRIMEHPC FX10 x86 Machine PRIMEHPC FX10 PC Cluster Mean noise ratio 0. 589E-04 0. 154E-01 Longest noise length(usec) 29.3 644.0 39/40

Summary Fujitsu s supercomputer, PRIMEHPC FX10, as well as K computer consists of high performance multi-core CPUs There are bunch of scalar tuning techniques to make each process faster We recommend to program applications in hybrid parallel manner to get a better performance By checking performance analysis information, you can find bottlenecks Some parallel tuning can be done using open MP directives Operating system could be an obstacle to get higher performance 40/40