LLVM-based Communication Optimizations for PGAS Programs
|
|
- Hannah Sherman
- 5 years ago
- Views:
Transcription
1 LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 1
2 A Big Picture Berkeley Lab. X10, Habanero-UPC++, on i t a ic n u n m o i t m Co timiza Op Photo Credits : Argonne National Lab. RIKEN AICS
3 PGAS Languages q High-productivity features: Global-View Task parallelism Data Distribution Synchronization Habanero-UPC++ X10 CAF Photo Credits :
4 Communication is implicit in some PGAS Programming Models q Global Address Space Compiler and Runtime is responsible for performing communications across nodes Remote Data Access in Chapel 1: var x = 1; // on Node 0 : on Locales[1] {// on Node 1 : = x; // DATA ACCESS 4: } 4
5 Communication is Implicit in some PGAS Programming Models (Cont d) Remote Data Access 1: var x = 1; // on Node 0 : on Locales[1] {// on Node 1 : = x; // DATA ACCESS Compiler Op>miza>on 1: var x = 1; : on Locales[1] { : = 1; OR! Run>me affinity handling if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get( ); } 5
6 Latency (ms) Communication Optimization is Important ! ! ! 10000! 1000! 100! 10! 1! Optimized (Bulk Transfer) Unoptimized Lower is better 1,500x! 59x! Transferred Byte A synthe>c Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband 6
7 PGAS Optimizations are language-specific Chapel Compiler Berkeley Lab. UPC Compiler Argonne National Lab. X10, Habanero-UPC++, X10 Compiler Habanero-C Compiler Photo Credits : RIKEN AICS 7
8 Our goal Berkeley Lab. Argonne National Lab. X10, Habanero-UPC++, Photo Credits : RIKEN AICS 8
9 Why LLVM? q Widely used language-agnostic compiler C/C++ Frontend Clang C/C++, Fortran, Ada, Objective-C Frontend dragonegg Chapel Frontend UPC++ Frontend LLVM Intermediate Representation (LLVM IR) Analysis & Optimizations x86 backend Power PC backend ARM backend PTX backend x86 Binary PPC Binary ARM Binary GPU Binary 9
10 Summary & Contributions q Our Observations : Many PGAS languages share semantically similar constructs PGAS Optimizations are language-specific q Contributions: Built a compilation framework that can uniformly optimize PGAS programs(initial Focus : Communication) ü Enabling existing LLVM passes for communication optimizations ü PGAS-aware communication optimizations Photo Credits :
11 Chapel Programs Chapel- LLVM frontend Overview of our framework Need to be implemented when supporting a new language/runtime Generally language-agnostic UPC++ Programs X10 Programs UPC++- LLVM frontend X10-LLVM frontend LLVM IR LLVM-based Communication Optimization passes Lowering Pass CAF Programs CAF-LLVM frontend 1. Vanilla LLVM IR. use address space feature to express communications 11
12 How optimizations work Chapel // x is possibly remote x = 1; UPC++ shared_var<int> x; x = 1; store i64 1, i64 addrspace(100)* %x, treat remote access as if it were local access 1.Existing LLVM Optimizations.PGAS-aware Optimizations Runtime-Specific Lowering" Communication API Calls! Address space-aware Optimizations 1
13 LLVM-based Communication Optimizations for Chapel 1. Enabling Existing LLVM passes Loop invariant code motion (LICM) Scalar replacement,. Aggregation Combine sequences of loads/stores on adjacent memory location into a single memcpy These are already implemented in the standard Chapel compiler 1
14 An optimization example: LICM for Communication Optimizations LICM by LLVM for i in { %x = load i64 addrspace(100)* %xptr A(i) = %x; } LICM = Loop Invariant Code Motion 14
15 An optimization example: Aggregation // p is possibly remote sum = p.x + p.y; load i64 addrspace(100)* %pptr+0 load i64 addrspace(100)* %pptr+4 x! y! GET! GET! llvm.memcpy( ); GET! 15
16 LLVM-based Communication Optimizations for Chapel. Locality Optimization Infer the locality of data and convert possiblyremote access to definitely-local access at compile-time if possible 4. Coalescing Remote array access vectorization These are implemented, but not in the standard Chapel compiler 16
17 An Optimization example: Locality Optimization 1: proc habanero(ref x, ref y, ref z) { : var p: int = 0; 1.A is definitelylocal : var A:[1..N] int; 4: local { p = z; } 5: z = A(0) + z;.p and z are 6:} definitely local.definitely-local access! (avoid run@me affinity checking) 17
18 An Optimization example: Coalescing Before 1:for i in 1..N { : = A(i); :} AUer Perform bulk transfer 1:localA = A; :for i in 1..N { : = locala(i); 4:} Converted to definitely-local access 18
19 Performance Evaluations: Benchmarks Application Size Smith-Waterman 185,600 x 19,000 Cholesky Decomp NPB EP 10,000 x 10,000 CLASS = D Sobel 48,000 x 48,000 SSCA Kernel 4 Stream EP SCALE = 16 ^0 19
20 Performance Evaluations: Platforms q Cray XC0 NERSC Node ü Intel Xeon x 4 cores ü 64GB of RAM Interconnect ü Cray Aries interconnect with Dragonfly topology q Westmere Rice Node ü Intel Xeon CPU x 1 cores ü 48 GB of RAM Interconnect ü Quad-data rated infiniband 0
21 Performance Evaluations: Details of Compiler & Runtime q Compiler Chapel Compiler version LLVM. q Runtime : GASNet-1..0 ü Cray XC : aries ü Westmere Cluster : ibv-conduit Qthreads-1.10 ü Cray XC: shepherds, 4 workers / shepherd ü Westmere Cluster : shepherds, 6 workers / shepherd 1
22 Performance Evaluation BRIEF SUMMARY OF PERFORMANCE EVALUATIONS
23 Performance Improvement over LLVM-unopt Results on the Cray XC (LLVM-unopt vs. LLVM-allopt) x 19.5x 1.1x.4x Higher is better Coalescing Aggregation 1.4x Locality Opt Existing 1.x SW Cholesky Sobel StreamEP EP SSCA ü 4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1,, 4, 8, 16,, 64 locales)
24 Performance Improvement over LLVM-unopt Results on Westmere Cluster (LLVM-unopt vs. LLVM-allopt) x 16.9x 1.1x.5x Coalescing Aggregation 1.x Locality Opt Existing.x SW Cholesky Sobel StreamEP EP SSCA ü 4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1,, 4, 8, 16,, 64 locales) 4
25 Performance Evaluation DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION 5
26 Cholesky Decomposition 6 dependencies Node0 Node1 Node Node
27 Metrics 1. Performance & Scalability Baseline (LLVM-unopt) LLVM-based Optimizations (LLVM-allopt). The dynamic number of communication API calls. Analysis of optimized code 4. Performance comparison Conventional C-backend vs. LLVM-backend 7
28 Speedup over LLVM-unopt 1locale Performance Improvement by LLVM (Cholesky on the Cray XC) LLVM-unopt LLVM-allopt locale! locales! 4 locales! 8 locales! 16 locales! locales! ü LLVM-based communication optimizations show scalability 8
29 Dynamic number of communication API calls (normalized to LLVM-unopt) Communication API calls elimination by LLVM (Cholesky on the Cray XC) LLVM-unopt 100.0% 100.0% 100.0% 100.0% 100.0% 89.% 8.x improvement 1.1% LLVM-allopt 500x improvement 0.% 1.1x improvement LOCAL GET REMOTE_GET LOCAL_PUT REMOTE_PUT 9
30 Analysis of optimized code LLVM-unopt for jb in zero..tilesize-1 { for kb in zero..tilesize-1 { 4GETS for ib in zero..tilesize-1 { 9GETS + 1PUT }}} LLVM-allopt 1.ALLOCATE LOCAL BUFFER.PERFORM BULK TRANSFER for jb in zero..tilesize-1 { for kb in zero..tilesize-1 { 1GET for ib in zero..tilesize-1 { 1GET + 1PUT }}} 0
31 Performance comparison with C-backend Speedup over LLVM-unopt 1locale C-backend LLVM-unopt LLVM-allopt C-backend is faster! locale locales 4 locales 8 locales 16 locales locales 64 locales 1
32 Current limitation For C Code Generation : 18bit struct pointer ptr.locale; ptr.addr; For LLVM Code Generation : 64bit packed pointer Locale addr (16bit) (48bit) ptr >> 48 ptr 48BITS_MASK; 1. Needs more instructions. Lose opportunities for Alias analysis q In LLVM., many optimizations assume that the pointer size is the same across all address spaces
33 Conclusions q LLVM-based Communication optimizations for PGAS Programs Promising way to optimize PGAS programs in a language-agnostic manner Preliminary Evaluation with 6 Chapel applications ü Cray XC0 Supercomputer 4.6x average performance improvement ü Westmere Cluster 4.4x average performance improvement
34 Future work q Extend LLVM IR to support parallel programs with PGAS and explicit task parallelism Higher-level IR Parallel Programs (Chapel, X10, CAF, HC, ) 1.RI-PIR Gen.Analysis.Transformation LLVM Runtime-Independent Optimizations e.g. Task Parallel Construct 1.RS-PIR Gen.Analysis.Transformation LLVM Runtime-Specific Optimizations e.g. GASNet API Binary 4
35 Acknowledgements q Special thanks to Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Habanero Group at Rice 5
36 Backup slides 6
37 Compilation Flow Chapel Programs AST Generation and Optimizations C-code Generation C Programs Backend Compiler s Optimizations (e.g. gcc O) Binary LLVM IR Generation LLVM IR LLVM Optimizations Binary 7
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi Rice University ahayashi@rice.edu Jisheng Zhao Rice University jisheng.zhao@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationCaching Puts and Gets in a PGAS Language Runtime
Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences September 17, 2015 C O M P U T E S T O R E A N A L Y Z E Safe Harbor
More informationCompiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018
Compiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationOmni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation
http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationImplementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS
Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP October 29, 2015 Hidetoshi Iwashita, RIKEN AICS Background XMP Contains Coarray Features XcalableMP (XMP) A PGAS language,
More informationPurity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationC PGAS XcalableMP(XMP) Unified Parallel
PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba
More informationCompilers and Compiler-based Tools for HPC
Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationPortable Parallel Programming for Multicore Computing
Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore
More informationShort Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy
Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,
More informationLecture 32: Partitioned Global Address Space (PGAS) programming models
COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP
More informationDEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES
DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES Tarek El-Ghazawi, François Cantonnet, Yiyi Yao Department of Electrical and Computer Engineering The George Washington University tarek@gwu.edu
More informationExploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.gpu Execution in Java Programs
Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.gpu Execution in Java Programs Gloria Kim (Rice University) Akihiro Hayashi (Rice University) Vivek Sarkar (Georgia
More informationIn the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.
In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE
13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationEvolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017
Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &
More informationPorting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT
Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?
More informationData-Centric Locality in Chapel
Data-Centric Locality in Chapel Ben Harshbarger Cray Inc. CHIUW 2015 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationThe ARES High-level Intermediate Representation
The ARES High-level Intermediate Representation Nick Moss, Kei Davis, Pat McCormick 11/14/16 About ARES HLIR is part of the ARES project (Abstract Representations for the Extreme-Scale Stack) Inter-operable
More informationInterconnect Your Future
Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer
More informationA Local-View Array Library for Partitioned Global Address Space C++ Programs
Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationA Case for Cooperative Scheduling in X10's Managed Runtime
A Case for Cooperative Scheduling in X10's Managed Runtime X10 Workshop 2014 June 12, 2014 Shams Imam, Vivek Sarkar Rice University Task-Parallel Model Worker Threads Please ignore the DP on the cartoons
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More information. Programming in Chapel. Kenjiro Taura. University of Tokyo
.. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44 Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and
More informationLLVM and IR Construction
LLVM and IR Construction Fabian Ritter based on slides by Christoph Mallon and Johannes Doerfert http://compilers.cs.uni-saarland.de Compiler Design Lab Saarland University 1 Project Progress source code
More informationKernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets
P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationThe Mother of All Chapel Talks
The Mother of All Chapel Talks Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010 Lecture Structure 1. Programming Models Landscape 2. Chapel Motivating Themes 3. Chapel Language Features 4. Project Status
More informationUnified Parallel C (UPC)
Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 21 March 27, 2008 Acknowledgments Supercomputing 2007 tutorial on Programming using
More informationPCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction
PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction Mihail Popov, Chadi kel, Florent Conti, William Jalby, Pablo de Oliveira Castro UVSQ - PRiSM - ECR Mai 28, 2015 Introduction
More informationThe Parallel Boost Graph Library spawn(active Pebbles)
The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University Origins Boost Graph Library (1999) Generic
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationPolly Polyhedral Optimizations for LLVM
Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM
More informationABySS Performance Benchmark and Profiling. May 2010
ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationChapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014
Chapel: An Emerging Parallel Programming Language Thomas Van Doren, Chapel Team, Cray Inc. Northwest C Users Group April 16 th, 2014 My Employer: 2 Parallel Challenges Square-Kilometer Array Photo: www.phy.cam.ac.uk
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationPolyhedral Optimizations of Explicitly Parallel Programs
Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationParallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen
Parallel Programming Languages HPC Fall 2010 Prof. Robert van Engelen Overview Partitioned Global Address Space (PGAS) A selection of PGAS parallel programming languages CAF UPC Further reading HPC Fall
More informationVictor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings
Victor Malyshkin (Ed.) Lecture Notes in Computer Science The LNCS series reports state-of-the-art results in computer science re search, development, and education, at a high level and in both printed
More informationGPI-2: a PGAS API for asynchronous and scalable parallel applications
GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationMath 230 Assembly Programming (AKA Computer Organization) Spring 2008
Math 230 Assembly Programming (AKA Computer Organization) Spring 2008 MIPS Intro II Lect 10 Feb 15, 2008 Adapted from slides developed for: Mary J. Irwin PSU CSE331 Dave Patterson s UCB CS152 M230 L10.1
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationOPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE
OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE Sean Hefty Openfabrics Interfaces Working Group Co-Chair Intel November 2016 OFIWG: develop interfaces aligned with application needs Open Source Expand
More informationNEMO Performance Benchmark and Profiling. May 2011
NEMO Performance Benchmark and Profiling May 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationIBM High Performance Computing Toolkit
IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea
More informationSami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1
Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationThe APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering
The APGAS Programming Model for Heterogeneous Architectures David E. Hudak, Ph.D. Program Director for HPC Engineering dhudak@osc.edu Overview Heterogeneous architectures and their software challenges
More informationLLVM and Clang on the Most Powerful Supercomputer in the World
LLVM and Clang on the Most Powerful Supercomputer in the World Hal Finkel November 7, 2012 The 2012 LLVM Developers Meeting Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November
More informationPortable, MPI-Interoperable! Coarray Fortran
Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationThe Arm Technology Ecosystem: Current Products and Future Outlook
The Arm Technology Ecosystem: Current Products and Future Outlook Dan Ernst, PhD Advanced Technology Cray, Inc. Why is an Ecosystem Important? An Ecosystem is a collection of common material Developed
More informationUsually, target code is semantically equivalent to source code, but not always!
What is a Compiler? Compiler A program that translates code in one language (source code) to code in another language (target code). Usually, target code is semantically equivalent to source code, but
More informationIn-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017
In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationCnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University
CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea,
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationPerformance Report Guidelines. Babak Behzad, Alex Brooks, Vu Dang 12/04/2013
Performance Report Guidelines Babak Behzad, Alex Brooks, Vu Dang 12/04/2013 Motivation We need a common way of presenting performance results on Blue Waters! Different applications Different needs Different
More informationIntel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*
Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products
More informationLoop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of
More informationEvaluation of PGAS Communication Paradigms With Geometric Multigrid
Lawrence Berkeley National Laboratory Evaluation of PGAS Communication Paradigms With Geometric Multigrid Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, and Katherine Yelick Lawrence Berkeley
More informationAn Overview of Fujitsu s Lustre Based File System
An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead Outline Target System Overview Goals of Fujitsu
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects
More informationMetropolitan Road Traffic Simulation on FPGAs
Metropolitan Road Traffic Simulation on FPGAs Justin L. Tripp, Henning S. Mortveit, Anders Å. Hansson, Maya Gokhale Los Alamos National Laboratory Los Alamos, NM 85745 Overview Background Goals Using the
More information