Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
|
|
- Robyn Brown
- 5 years ago
- Views:
Transcription
1 Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland, College Park CHIUW 05/25/2018
2 Motivation and Concept Our goals for developing a profiling system Analyze memory and communication access patterns Support for multi-node PGAS environments Integrate profiler into the Chapel framework General approach 1. Identify and instrument remotely accessible variables in user modules through the compiler 2. Map the memory addresses referenced by remote operations at runtime to variable definitions 3. Perform an analysis over the runtime execution 2
3 Challenge and Direction 1. Unable to identify remote accesses until late pass Placeholders and late stage instrumentation are required 2. Each node only has a partial view of the PGAS Unable to map remote addresses to variable definitions Solution: Resolve addresses in a post analysis setting Unified view of PGAS in an offline analysis Communication / memory operations need to be recorded The memory states of the nodes need to be synchronized 3
4 Design Overview 4
5 Static Analyzer Adds five new passes to compiler Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.1
6 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.2
7 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.3
8 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.4
9 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.5
10 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders 5. Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.6
11 Static Analyzer Adds five new passes to compiler 1. Source analysis pass Loads profile module and profile input file Analyzes program ASTs after parser Acquires and stores maps and definitions 2. Instrument variable definition placeholders 3. Instrument compiler generated main Instr. profile load, coordination, and reporting 4. Instrument constructor field placeholders 5. Instrument wide references Placeholder pruning and final instrumentation Profile propagation Persists profile data across all of the passes 5.7
12 Dynamic Analyzer Dynamic analysis Monitors memory and communication operations Embedded in the Chapel runtime framework Executed through the programs instrumentation Is configurable through environment variables Tracking memory Heap operations for variables Chapel allocations, reallocations, and free operations Stack variables Associates address and size with beginning and end of scope 6
13 Dynamic Analyzer Tracking communications Monitoring Local and remote get and put operations Remote prefetch and cache operations Pulse sampling Samples data at periodic intervals (on / off) Scalable: Reduce runtime overhead and event log size Support: SIGALRM / SIGPROF / PAPI Synchronization Coordinate events in post analysis Support: GASNet, In progress: UGNI 7
14 Dynamic Analyzer: Output Online reporting High level report of local and remote operations Aggregates over all nodes and entire execution Event recorder Each node produces its own event log Used in post analysis What is recorded? Synchronizations Memory operations Communications 8
15 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions 9.1 Unified view of the PGAS Loop Remote Analysis: Operations Remote by Operations Variable 100% Accesses to Zcyc[ ] Remaining by Node 90% Operations Remaining Dst \ Src n0 n1fft.chpl, n295 n3 80% get: CycDom Twiddles 70% n fft.chpl, Unknown % get: Zblk 50% n fft.chpl, Twiddles get: Zcyc 40% Zblk n fft.chpl, % get: Zcyc Zcyc 20% n fft.chpl, put: Zcyc 10% 0% fft.chpl, put: Zcyc
16 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview 9.2 Remote Operations by Variable 100% 90% Remaining 80% CycDom 70% Unknown 60% 50% Twiddles 40% Zblk 30% Zcyc 20% 10% 0%
17 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview Variable-Loop Analysis 9.3 Loop Remote Analysis: Operations Remote by Operations Variable 100% Accesses to Zcyc[ ] Remaining by Node 90% Operations Remaining Dst \ Src n0 n1fft.chpl, n295 n3 80% get: CycDom Twiddles 70% n fft.chpl, Unknown % get: Zblk 50% n fft.chpl, Twiddles get: Zcyc 40% Zblk n fft.chpl, % get: Zcyc Zcyc 20% n fft.chpl, put: Zcyc 10% 0% fft.chpl, put: Zcyc
18 Development: Post Analysis Simulation Process event logs into a unified timeline Synchronization: Controls progression of simulation Memory Operations: Update memory environments Communications: Addresses mapped to definitions Unified view of the PGAS Analyzer (ranked results) Variable-Cluster Overview Variable-Loop Analysis Variable-Node and Index Accesses to Zcyc[ ] by Node Dst \ Src n0 n1 n2 n3 n n n n
19 Evaluation Deepthought2 HPC cluster Housed at the University of Maryland 444 nodes, 20 cores and 128 GB memory per node FDR infiniband interconnect with 56 Gb/s max throughput Chapel configuration GASNet with IBV substrate Using MPI as the spawner Memory segment: everything All memory is remotely accessible Evaluated on SSCA#2 benchmark 10
20 SSCA#2: Evaluation Description Series of approximate betweeness centrality (BC) calculations Data structure: Weighted, directed multigraph Different input sizes: Sets of starting vertices Evaluation 1) Analyzed overhead and scalability with pulse sampling 2) Evaluated network caching impact on remote operations 3) Performed loop analysis and node level aggregation On remote operations to identify PGAS bottlenecks 11
21 Overhead SSCA#2: Overhead and Scalability Pulse Sample Scaling Tracked all 421 variables Online reporting only +16% of wall time With event log generation 2 1% sample rate % 10% 5% 1% Sample Rate A) Slow Down A) Size (GB) B) Slow Down B) Size (GB) Deepthought2, 4 nodes, 20 threads, scale=8, low scale=5, high scale=8 A. Tracking stack & heap +95% of wall time 1.65 GB for event logs B. Tracking heap only +62% of wall time 14.3 MB for event logs
22 SSCA#2: Network Caching 100% 90% 80% Online Report Remote Operations Prefetch Cached Remote Put Remote Get Local Put Local Get 70% 60% 50% 4N 20T 8N 10T 4N 20T 8N 10T 13 Deepthought2, N nodes, T threads, scale=8, low scale=5, high scale=8 Percent Remote 4N20T 8N10T Normal 5.05% 4.61% Cached 4.96% 4.27% Remote Cached Cached 20.6% 22.2%
23 Remote Operations SSCA#2: Analysis (Remote Operations) Loop Analysis 100% 80% 60% 40% 20% 19.01% 80.99% SSCA2_kernels.chpl, Approximate Betweeness Centrality 279: forall s in starting_vertices do on vertex_domain.dist.idxtolocale(s) { : const tid = TPVM.gettid(); 287: const tpv = TPVM.getTPV(tid); 80.99% of all remote operations Acquiring this.tpv inside gettid() 14 0% SSCA2_kernels.chpl, 582 get: TPVM.TPV Remaining Operations 5% sample rate, 98.44% coverage Deepthought2, 4 nodes, 20 threads, scale=8, low scale=5, high scale=8 Remote Requests for TPVM.TPV Sender Receiver n0 n1 n2 n3 n Exists only on the primary node
24 Operations (Millions) SSCA#2: Optimization Original vs. Optimized Remote Online Report 4N 20T Remote Get 8N 10T Remote Put Deepthought2, N nodes, T threads, scale=8, low scale=5, high scale=8 Original: SSCA2_kernel.chpl 576: class TPVManager { 579: proc gettid() { 580: const tid = this.currtpv.fetchadd(1) % num 581: on this.tpv[tid] do 582: while this.tpv[tid].used.testandset() do chpl_task_yield(); 583: return tid; Optimized: SSCA2_kernel.chpl 576: class TPVManager { 579: proc gettid() { 580: const tid = this.currtpv.fetchadd(1) % num 581: on this.tpv[tid] do { 582: const t = this.tpv[tid] 583: while t.used.testandset() do chpl_task_yiel 584: } 585: return tid; Config Opt % Remote Reduction 4N 20T 0.47% 88.45% 8N 10T 1.19% 56.87%
25 Conclusion We developed a data-centric, communication profiler Analyzes memory and communication access patterns Supports multi-node PGAS environments Integrated into the Chapel framework and configurable Handles loop hierarchies and complex data structures Produces online reporting and scalable, fine-grain profiling Demonstrated with SSCA#2 benchmark 16 Identified real PGAS bottlenecks Improve developer s understanding of Where PGAS dependencies may exist and How task-data locality behaves in their program Improvements due to Purity insights Up to 88% of remote operations where eliminated Up to 1.24x speedup of the program wall time
26 Loop Analysis Example (HPCC-FFT) Remote Operations 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Bonus Remaining Operations fft.chpl, 295 get: Twiddles fft.chpl, 295 get: Zblk fft.chpl, get: Zcyc fft.chpl, get: Zcyc fft.chpl, put: Zcyc fft.chpl, put: Zcyc Deepthought2, 4 nodes, 20 threads, n=16 293: proc bitreverseshuffle(vect: [?Dom]) { 294: const numbits = log 2 (Vect.numElements), 295: Perm: [Dom] Vect.eltType = [i in Dom] Vect(bitReverse(i, revbits=numbits)); 296: Vect = Perm; 33.97% remote: Zblk permutations 3.53% remote: Twiddles permutations 117, 327: forall (b, c) in zip(zblk, Zcyc) do 118, 328: b = c; 33.25% remote: Mapping Zcyc to Zblk 112, 324: forall (b, c) in zip(zblk, Zcyc) do 113, 325: c = b; 20.22% remote: Mapping Zblk to Zcyc Zcyc, local vs. remote: 82.86% remote
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More information. Programming in Chapel. Kenjiro Taura. University of Tokyo
.. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44 Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationCaching Puts and Gets in a PGAS Language Runtime
Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences September 17, 2015 C O M P U T E S T O R E A N A L Y Z E Safe Harbor
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationData-Centric Locality in Chapel
Data-Centric Locality in Chapel Ben Harshbarger Cray Inc. CHIUW 2015 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationThe Exascale Architecture
The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected
More informationStackVsHeap SPL/2010 SPL/20
StackVsHeap Objectives Memory management central shared resource in multiprocessing RTE memory models that are used in Java and C++ services for Java/C++ programmer from RTE (JVM / OS). Perspectives of
More informationHybrid MPI - A Case Study on the Xeon Phi Platform
Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationTechnical Computing Suite supporting the hybrid system
Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationRecovering Disk Storage Metrics from low level Trace events
Recovering Disk Storage Metrics from low level Trace events Progress Report Meeting May 05, 2016 Houssem Daoud Michel Dagenais École Polytechnique de Montréal Laboratoire DORSAL Agenda Introduction and
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationAnalyzing I/O Performance on a NEXTGenIO Class System
Analyzing I/O Performance on a NEXTGenIO Class System holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden LUG17, Indiana University, June 2 nd 2017 NEXTGenIO Fact Sheet Project Research & Innovation
More informationOptimize HPC - Application Efficiency on Many Core Systems
Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018 2 2018 Arm Limited Speedup Multithreading and scalability I wrote my program to
More informationAcuSolve Performance Benchmark and Profiling. October 2011
AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationCharacterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013
Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage
More informationPortable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationAdvanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)
Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Understanding I/O Performance Behavior (UIOP) 2017 Sebastian Oeste, Mehmet Soysal, Marc-André Vef, Michael Kluge, Wolfgang E.
More informationChapter 10: Performance Patterns
Chapter 10: Performance Patterns Patterns A pattern is a common solution to a problem that occurs in many different contexts Patterns capture expert knowledge about best practices in software design in
More informationProgramming for Fujitsu Supercomputers
Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationOut-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays
Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays Wagner T. Corrêa James T. Klosowski Cláudio T. Silva Princeton/AT&T IBM OHSU/AT&T EG PGV, Germany September 10, 2002 Goals Render
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationPRIMEHPC FX10: Advanced Software
PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for
More informationRuntime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism
1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background
More informationQuery processing for parallel languages. Brandon Myers, Mark Oskin, Bill Howe DB Day 2015
Query processing for parallel languages Brandon Myers, Mark Oskin, Bill Howe bdmyers@cs.washington.edu DB Day 2015 1 slide src: Jeff Gardner 2 How to turn astrophysics simulation output into scientific
More informationScaling with PGAS Languages
Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationPortable, MPI-Interoperable! Coarray Fortran
Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer
More informationHigh Performance Erlang: Pitfalls and Solutions. MZSPEED Team Machine Zone, Inc
High Performance Erlang: Pitfalls and Solutions MZSPEED Team. 1 Agenda Erlang at Machine Zone Issues and Roadblocks Lessons Learned Profiling Techniques Case Study: Counters 2 Erlang at Machine Zone High
More informationHigh Performance Computing Systems
High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: Full weight kernel Lightweight kernel Why not both? How does implementation affect usage and performance?
More informationScalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center
Scalability of Trace Analysis Tools Jesus Labarta Barcelona Supercomputing Center What is Scalability? Jesus Labarta, Workshop on Tools for Petascale Computing, Snowbird, Utah,July 2007 2 Index General
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationThe APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering
The APGAS Programming Model for Heterogeneous Architectures David E. Hudak, Ph.D. Program Director for HPC Engineering dhudak@osc.edu Overview Heterogeneous architectures and their software challenges
More informationPortable, MPI-Interoperable! Coarray Fortran
Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationBenchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016
Benchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationMDHIM: A Parallel Key/Value Store Framework for HPC
MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationPerformance Analysis with Periscope
Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationPHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan
PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More information16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model
Today s menu 1. What do you program? Parallel complexity and algorithms PRAM Algorithms 2. The PRAM Model Definition Metrics and notations Brent s principle A few simple algorithms & concepts» Parallel
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationMultiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.
Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline
More informationDesigning Shared Address Space MPI libraries in the Many-core Era
Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationSPIN Operating System
SPIN Operating System Motivation: general purpose, UNIX-based operating systems can perform poorly when the applications have resource usage patterns poorly handled by kernel code Why? Current crop of
More informationDo we need a crystal ball for task migration?
Do we need a crystal ball for task migration? Brandon {Myers,Holt} University of Washington bdmyers@cs.washington.edu 1 Large data sets Data 2 Spread data Data.1 Data.2 Data.3 Data.4 Data.0 Data.1 Data.2
More informationSTREAM and RA HPC Challenge Benchmarks
STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations results from SC 09 competition AMR Computations hierarchical, regular computation SSCA #2 unstructured graph computation 2 Two classes
More informationVictor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings
Victor Malyshkin (Ed.) Lecture Notes in Computer Science The LNCS series reports state-of-the-art results in computer science re search, development, and education, at a high level and in both printed
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationAcuSolve Performance Benchmark and Profiling. October 2011
AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationEfficient Data Race Detection for Unified Parallel C
P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,
More informationAllinea Unified Environment
Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationSR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationFrom the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC
From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018 To exascale... and beyond 2 Vision The multicore
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline
More informationStreamBox: Modern Stream Processing on a Multicore Machine
StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationCSCE 626 Experimental Evaluation.
CSCE 626 Experimental Evaluation http://parasol.tamu.edu Introduction This lecture discusses how to properly design an experimental setup, measure and analyze the performance of parallel algorithms you
More informationMotivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency
Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationTechniques to improve the scalability of Checkpoint-Restart
Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationMemory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25
ROME Workshop @ IPDPS Vancouver Memory Footprint of Locality Information On Many- Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25 Locality Matters to HPC Applications Locality Matters
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More information