LLVM and Clang on the Most Powerful Supercomputer in the World

Similar documents
Trends in HPC (hardware complexity and software challenges)

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

IBM Blue Gene/Q solution

ALCF Argonne Leadership Computing Facility

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Parallel I/O on JUQUEEN

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

BlueGene/L (No. 4 in the Latest Top500 List)

Carlo Cavazzoni, HPC department, CINECA

LLVM Auto-Vectorization

Progress Report on QDP-JIT

Kampala August, Agner Fog

Supporting the new IBM z13 mainframe and its SIMD vector unit

Early experience with Blue Gene/P. Jonathan Follows IBM United Kingdom Limited HPCx Annual Seminar 26th. November 2007

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Jülich Supercomputing Centre

The Mont-Blanc approach towards Exascale

The Fusion Distributed File System

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-scale Multicore Clusters

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HPC and the AppleTV-Cluster

Parallel Computing: From Inexpensive Servers to Supercomputers

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Technology for a better society. hetcomp.com

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Fujitsu s Approach to Application Centric Petascale Computing

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Intel Performance Libraries

Parallel Languages: Past, Present and Future

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Matrix Multiplication

Top500 Supercomputer list

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

3.3 Hardware Parallel processing

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Large-scale Multithreaded BlueGene/Q Supercomputer

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

File Memory for Extended Storage Disk Caches

Introduction to CELL B.E. and GPU Programming. Agenda

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Key Point. What are Cache lines

GPU Fundamentals Jeff Larkin November 14, 2016

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Fundamental CUDA Optimization. NVIDIA Corporation

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Master Informatics Eng.

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Interval arithmetic on the Cell processor

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CellSs Making it easier to program the Cell Broadband Engine processor

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Matrix Multiplication

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Compilers and Code Optimization EDOARDO FUSELLA

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

LLVM for the future of Supercomputing

Introduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Windowing System on a 3D Pipeline. February 2005

Parallel Exact Inference on the Cell Broadband Engine Processor

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Compilation for Heterogeneous Platforms

Fundamental CUDA Optimization. NVIDIA Corporation

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

ABySS Performance Benchmark and Profiling. May 2010

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

BlueGene/L. Computer Science, University of Warwick. Source: IBM

HPM Hardware Performance Monitor for Bluegene/Q

High Performance Computing Lecture 1. Matthew Jacob Indian Institute of Science

Oak Ridge National Laboratory Computing and Computational Sciences

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HPC Architectures. Types of resource currently in use

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer

Blue Gene/P Universal Performance Counters

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Problem 1. (15 points):

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Composite Metrics for System Throughput in HPC

Transcription:

LLVM and Clang on the Most Powerful Supercomputer in the World Hal Finkel November 7, 2012 The 2012 LLVM Developers Meeting Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 1 / 27

1 Introduction ANL and the ALCF The BG/Q 2 The A2 Core and QPX 3 Porting LLVM and Clang 4 Results 5 Conclusion Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 2 / 27

What is Argonne National Laboratory Argonne National Laboratory (ANL), located just outside of Chicago, is one of the U.S. Department of Energy s largest national laboratories for scientific and engineering research. We have over 1,250 scientists and engineers, and over 300 postdoctoral researchers, in 13 research divisions and 6 national scientific user facilities. Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 3 / 27

What is the Argonne Leadership Computing Facility The Argonne Leadership Computing Facility (ALCF) is one of two leadership computing facilities supported by the U.S. Department of Energy (DOE). The ALCF provides the computational science community with computing capability dedicated to breakthrough science and engineering targeting at-scale run campaigns. The ALCF is now home to Mira, a 10-petaflop IBM Blue Gene/Q system with 49,152 compute nodes. We also continue to operate the Intrepid, a 557-teraflop IBM Blue Gene/P system. Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 4 / 27

The Blue Gene/Q (BG/Q) ALCF s Mira BG/Q system: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 5 / 27

The Blue Gene/Q (BG/Q) Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 6 / 27

The Top500 List of Supercomputing Sites Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 7 / 27

The Green500 List of Supercomputing Sites (entries 1 through 20 are all BG/Qs; in 21 st place is an Intel Xeon/MIC cluster @ 1,380.67 MFLOPS/W) Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 8 / 27

The BG/Q Node 16 user-accessible cores, 1 system core, and 1 spare @ 1.6 GHz (204.8 peak GFLOPS) 64-bit PowerPC ISA + 4-way double-precision SIMD 16 kb L1 D-cache, 16 kb L1 I-cache 32 MB shared L2 cache multiversioned supports transactional memory and speculative execution built in support for atomic operations Dual memory controllers 16 GB external 1.333 GHz DDR3 42.6 GB/s DDR3 bandwidth Network with 5D Torus topology 2 GB/s send and 2 GB/s receive DMA with remote put/get and collective operations Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 9 / 27

The A2 Core The embedded 64-bit A2 core: Two pipelines, one for floating point, one for everything else Four hardware threads; one instruction dispatched to each pipeline at each cycle from different threads Most instructions, including L1 load and store, don t stall Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 10 / 27

QPX Quad-Vector Floating-Point (QPX) extends the regular PowerPC floating-point registers to vectors of four: The scalar floating-point registers alias the first element of each corresponding vector register A scalar floating-point load splats the loaded value into all vector elements Single-precision floating point is supported via instruction variants that round to single precision Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 11 / 27

QPX Instructions vector loads and stores, both single-precision and double-precision. Loading and storing of two complex values general permutations, alignment-based permutations (to support unaligned accesses), subvector extraction and splat vector loads and stores of integer values, float/int conversions (but no other integer operations) negate, abs, negative abs, copy sign, rounding (to single precision, to integer in various ways) add, sub, mul, recip. est., recip. sqrt est. mul-add, mul-sub, negative mul-add, negative mul-sub, various cross mul-adds for complex arithmetic compare less-than, compare greater-than, compare equal, test for NaN generalized boolean operations, vector select Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 12 / 27

QPX Booleans In QPX, floating point values are also booleans: Values greater than or equal to ±0.0 are true Values less that 0.0 or NaN are false Instructions that produce booleans produce ±1.0 Instead of providing specific boolean operations, one instruction is provided that takes an arbitrary truth table: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 13 / 27

Why? Provide a high-performance and up-to-date C/C++ compiler over the lifetime of the machine (the fact that Clang has excellent diagnostics and a static-analysis framework makes this even more appealing) Provide access to other languages that use LLVM has a backend (such as Intel s ISPC, and various scripting languages) Provide a platform for compiler research supporting the BG/Q Autovectorization Parallelization (including transactional memory and speculative execution) Communication-related optimizations and distributed systems Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 14 / 27

Tagged-Type Diagnostics The tagged-type diagnostics, developed by Dmitri Gribenko, are an important new feature in Clang that will greatly benefit the HPC community. To my knowledge, no other compiler can produce these kinds of warnings, and these will be extremely valuable to our users. The wider community can also benefit by applying the relevant attributes to, for example, some POSIX APIs. Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 15 / 27

The Easy Parts Modifications to the PowerPC backend: Register definitions, calling conventions, declaring instruction legality, patterns for basic arithmetic operations, loads and stores, etc. Developing an itinerary for the A2 core based on IBM s documentation, and making associated modifications to the subtarget code Adapting the Hexagon hardware loops pass to do the equivalent thing for PowerPC Adding support for QPX intrinsics in Clang (and developing a header file compatible with the corresponding vendor-compiler intrinsics) Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 16 / 27

The More-Difficult Parts Developing a basic-block autovectorizer (see the presentation from the 2012 Euro-LLVM conference, and come to the BoF later at this meeting) Modifications to the SelectionDAG builder to loosen the critical-chain restrictions Adding support for v4i1 booleans to support QPX logical operations (still in progress) Cleaning up, modernizing and fixing bugs in the PowerPC backend (IBM has recently started to help with this) Fixing bugs elsewhere in LLVM exposed by the new target code Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 17 / 27

Boost ublas: Benchmark 1 Dense matrix and vector operations: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 18 / 27

Boost ublas: Benchmark 2 Sparse matrix and vector operations: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 19 / 27

Boost ublas: Benchmark 3 Vector and matrix proxy s operations: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 20 / 27

Boost ublas: Benchmark 4 Dense matrix and vector operations with boost::numeric::interval: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 21 / 27

TSVC The first few tests from the TSVC autovectorization benchmark: Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 22 / 27

TSVC The first few tests from the TSVC autovectorization benchmark (without vectorization): Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 23 / 27

TSVC The first few tests from the TSVC autovectorization benchmark (without vectorization, compared to gcc for the BG/Q): Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 24 / 27

Future Work In addition to general LLVM work: Full support for generating all QPX instructions (will require further enhancement to the BB vectorizer and the loop vectorizer, more target-level DAG combiner enhancements, and support for the single-precision-rounded variants) and support for vendor-supplied vectorized math libraries More PowerPC backend enhancements: For example, better handling of condition registers (especially how they are spilled) Parallelization support (especially, but not limited to, OpenMP) Higher level loop transformations (using Polly) MPI-specific optimizations Make LLVM (with Clang) a powerful force in HPC! Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 25 / 27

Acknowledgements From idea to production-stable, high performance, autovectorizing compiler for the BG/Q is a long road (and we re not quite there yet), and I ve had a lot of help: Roman Divacky, Tobias von Koch, and others who have contributed to the PowerPC backend (now including several people from IBM s LTC: Bill Schmidt, Will Schmidt, Ulrich Weigand, Adhemerval Zanella, and Peter Bergner) Tobi Grosser, Nadav Rotem, and others who have helped with development of the vectorizer Andy Trick, Jakob S. Olesen (for answering many questions on scheduling, register allocation, etc.) The LLVM and Clang development communities ALCF, ANL and DOE Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 26 / 27

Science Areas (science areas at NERSC from a 2009 presentation by John Shalf) Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November 7, 2012 27 / 27