Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Size: px

Start display at page:

Download "Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language"

Robyn Brown
5 years ago
Views:

1 Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland, College Park CHIUW 05/25/2018

2 Motivation and Concept Our goals for developing a profiling system Analyze memory and communication access patterns Support for multi-node PGAS environments Integrate profiler into the Chapel framework General approach 1. Identify and instrument remotely accessible variables in user modules through the compiler 2. Map the memory addresses referenced by remote operations at runtime to variable definitions 3. Perform an analysis over the runtime execution 2

3 Challenge and Direction 1. Unable to identify remote accesses until late pass Placeholders and late stage instrumentation are required 2. Each node only has a partial view of the PGAS Unable to map remote addresses to variable definitions Solution: Resolve addresses in a post analysis setting Unified view of PGAS in an offline analysis Communication / memory operations need to be recorded The memory states of the nodes need to be synchronized 3

4 Design Overview 4

5 Static Analyzer Adds five new passes to compiler Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.1

6 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.2

7 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.3

8 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.4

9 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.5

10 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders 5. Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.6

11 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders 5. Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.7

12 Dynamic Analyzer Dynamic analysis Monitors memory and communication operations Embedded in the Chapel runtime framework Executed through the programs instrumentation Is configurable through environment variables Tracking memory Heap operations for variables Chapel allocations, reallocations, and free operations Stack variables Associates address and size with beginning and end of scope 6

13 Dynamic Analyzer Tracking communications Monitoring Local and remote get and put operations Remote prefetch and cache operations Pulse sampling Samples data at periodic intervals (on / off) Scalable: Reduce runtime overhead and event log size Support: SIGALRM / SIGPROF / PAPI Synchronization Coordinate events in post analysis Support: GASNet, In progress: UGNI 7

14 Dynamic Analyzer: Output Online reporting High level report of local and remote operations Aggregates over all nodes and entire execution Event recorder Each node produces its own event log Used in post analysis What is recorded? Synchronizations Memory operations Communications 8

Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications:

15 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions 9.1 Unified view of the PGAS Loop Remote Analysis: Operations Remote by Operations Variable 100% Accesses to Zcyc[ ] Remaining by Node 90% Operations Remaining Dst \ Src n0 n1fft.chpl, n295 n3 80% get: CycDom Twiddles 70% n fft.chpl, Unknown % get: Zblk 50% n fft.chpl, Twiddles get: Zcyc 40% Zblk n fft.chpl, % get: Zcyc Zcyc 20% n fft.chpl, put: Zcyc 10% 0% fft.chpl, put: Zcyc

16 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview 9.2 Remote Operations by Variable 100% 90% Remaining 80% CycDom 70% Unknown 60% 50% Twiddles 40% Zblk 30% Zcyc 20% 10% 0%

17 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview Variable-Loop Analysis 9.3 Loop Remote Analysis: Operations Remote by Operations Variable 100% Accesses to Zcyc[ ] Remaining by Node 90% Operations Remaining Dst \ Src n0 n1fft.chpl, n295 n3 80% get: CycDom Twiddles 70% n fft.chpl, Unknown % get: Zblk 50% n fft.chpl, Twiddles get: Zcyc 40% Zblk n fft.chpl, % get: Zcyc Zcyc 20% n fft.chpl, put: Zcyc 10% 0% fft.chpl, put: Zcyc

18 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview Variable-Loop Analysis Variable-Node and Index Accesses to Zcyc[ ] by Node Dst \ Src n0 n1 n2 n3 n n n n

19 Evaluation Deepthought2 HPC cluster Housed at the University of Maryland 444 nodes, 20 cores and 128 GB memory per node FDR infiniband interconnect with 56 Gb/s max throughput Chapel configuration GASNet with IBV substrate Using MPI as the spawner Memory segment: everything All memory is remotely accessible Evaluated on SSCA#2 benchmark 10

20 SSCA#2: Evaluation Description Series of approximate betweeness centrality (BC) calculations Data structure: Weighted, directed multigraph Different input sizes: Sets of starting vertices Evaluation 1) Analyzed overhead and scalability with pulse sampling 2) Evaluated network caching impact on remote operations 3) Performed loop analysis and node level aggregation On remote operations to identify PGAS bottlenecks 11

21 Overhead SSCA#2: Overhead and Scalability Pulse Sample Scaling Tracked all 421 variables Online reporting only +16% of wall time With event log generation 2 1% sample rate % 10% 5% 1% Sample Rate A) Slow Down A) Size (GB) B) Slow Down B) Size (GB) Deepthought2, 4 nodes, 20 threads, scale=8, low scale=5, high scale=8 A. Tracking stack & heap +95% of wall time 1.65 GB for event logs B. Tracking heap only +62% of wall time 14.3 MB for event logs

22 SSCA#2: Network Caching 100% 90% 80% Online Report Remote Operations Prefetch Cached Remote Put Remote Get Local Put Local Get 70% 60% 50% 4N 20T 8N 10T 4N 20T 8N 10T 13 Deepthought2, N nodes, T threads, scale=8, low scale=5, high scale=8 Percent Remote 4N20T 8N10T Normal 5.05% 4.61% Cached 4.96% 4.27% Remote Cached Cached 20.6% 22.2%

23 Remote Operations SSCA#2: Analysis (Remote Operations) Loop Analysis 100% 80% 60% 40% 20% 19.01% 80.99% SSCA2_kernels.chpl, Approximate Betweeness Centrality 279: forall s in starting_vertices do on vertex_domain.dist.idxtolocale(s) { : const tid = TPVM.gettid(); 287: const tpv = TPVM.getTPV(tid); 80.99% of all remote operations Acquiring this.tpv inside gettid() 14 0% SSCA2_kernels.chpl, 582 get: TPVM.TPV Remaining Operations 5% sample rate, 98.44% coverage Deepthought2, 4 nodes, 20 threads, scale=8, low scale=5, high scale=8 Remote Requests for TPVM.TPV Sender Receiver n0 n1 n2 n3 n Exists only on the primary node

24 Operations (Millions) SSCA#2: Optimization Original vs. Optimized Remote Online Report 4N 20T Remote Get 8N 10T Remote Put Deepthought2, N nodes, T threads, scale=8, low scale=5, high scale=8 Original: SSCA2_kernel.chpl 576: class TPVManager { 579: proc gettid() { 580: const tid = this.currtpv.fetchadd(1) % num 581: on this.tpv[tid] do 582: while this.tpv[tid].used.testandset() do chpl_task_yield(); 583: return tid; Optimized: SSCA2_kernel.chpl 576: class TPVManager { 579: proc gettid() { 580: const tid = this.currtpv.fetchadd(1) % num 581: on this.tpv[tid] do { 582: const t = this.tpv[tid] 583: while t.used.testandset() do chpl_task_yiel 584: } 585: return tid; Config Opt % Remote Reduction 4N 20T 0.47% 88.45% 8N 10T 1.19% 56.87%

25 Conclusion We developed a data-centric, communication profiler Analyzes memory and communication access patterns Supports multi-node PGAS environments Integrated into the Chapel framework and configurable Handles loop hierarchies and complex data structures Produces online reporting and scalable, fine-grain profiling Demonstrated with SSCA#2 benchmark 16 Identified real PGAS bottlenecks Improve developer s understanding of Where PGAS dependencies may exist and How task-data locality behaves in their program Improvements due to Purity insights Up to 88% of remote operations where eliminated Up to 1.24x speedup of the program wall time

Loop Analysis Example (HPCC-FFT) Remote Operations 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Bonus Remaining Operations fft.chpl, 295 get: Twiddles fft.chpl, 295 get: Zblk fft.

Dom]) { 294: const numbits = log 2 (Vect.numElements), 295: Perm: [Dom] Vect.eltType = [i in Dom] Vect(bitReverse(i, revbits=numbits)); 296: Vect = Perm; 33.97% remote: Zblk permutations 3.

26 Loop Analysis Example (HPCC-FFT) Remote Operations 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Bonus Remaining Operations fft.chpl, 295 get: Twiddles fft.chpl, 295 get: Zblk fft.chpl, get: Zcyc fft.chpl, get: Zcyc fft.chpl, put: Zcyc fft.chpl, put: Zcyc Deepthought2, 4 nodes, 20 threads, n=16 293: proc bitreverseshuffle(vect: [?Dom]) { 294: const numbits = log 2 (Vect.numElements), 295: Perm: [Dom] Vect.eltType = [i in Dom] Vect(bitReverse(i, revbits=numbits)); 296: Vect = Perm; 33.97% remote: Zblk permutations 3.53% remote: Twiddles permutations 117, 327: forall (b, c) in zip(zblk, Zcyc) do 118, 328: b = c; 33.25% remote: Mapping Zcyc to Zblk 112, 324: forall (b, c) in zip(zblk, Zcyc) do 113, 325: c = b; 20.22% remote: Mapping Zblk to Zcyc Zcyc, local vs. remote: 82.86% remote

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,