Compilers and Run-Time Systems for High-Performance Computing

Similar documents
Compiler Technology for Problem Solving on Computational Grids

Compiler Architecture for High-Performance Problem Solving

Component Architectures

High Performance Computing. Without a Degree in Computer Science

Grid Application Development Software

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University

UG3 Compiling Techniques Overview of the Course

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Compiling Techniques

Why Performance Models Matter for Grid Computing

Grid Computing: Application Development

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

GrADSoft and its Application Manager: An Execution Mechanism for Grid Applications

CS 526 Advanced Topics in Compiler Construction. 1 of 12

Toward a Framework for Preparing and Executing Adaptive Grid Programs

Virtual Grids. Today s Readings

Future Applications and Architectures

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

Compilation for Heterogeneous Platforms

HETEROGENEOUS COMPUTING

An Empirical Study of Computation-Intensive Loops for Identifying and Classifying Loop Kernels

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Telescoping MATLAB for DSP Applications

What are Clusters? Why Clusters? - a Short History

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Why Performance Models Matter for Grid Computing

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

High Performance Computing Course Notes Grid Computing I

LCPC Arun Chauhan and Ken Kennedy

Center Extreme Scale CS Research

The Cray Rainier System: Integrated Scalar/Vector Computing

Biological Sequence Alignment On The Computational Grid Using The Grads Framework

1: Introduction to Object (1)

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Knowledge Discovery Services and Tools on Grids

The Cascade High Productivity Programming Language

Compiler Design. Dr. Chengwei Lei CEECS California State University, Bakersfield

An Evaluation of Alternative Designs for a Grid Information Service

Parallel Matlab Based on Telescoping Languages and Data Parallel Compilation. Ken Kennedy Rice University

Rigorously Test Composite Applications Faster With CA Test Data Manager and CA Agile Requirements Designer

Bandwidth-Based Performance Tuning and Prediction

Roofline Model (Will be using this in HW2)

Introduction to Grid Computing

Grid Computing Systems: A Survey and Taxonomy

Overview of the Class

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries

The Memory Bandwidth Bottleneck and its Amelioration by a Compiler

CA ERwin Data Modeler s Role in the Relational Cloud. Nuccio Piscopo.

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach

Experiments with Scheduling Using Simulated Annealing in a Grid Environment

Building Performance Topologies for Computational Grids UCSB Technical Report

pc++/streams: a Library for I/O on Complex Distributed Data-Structures

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

IQ for DNA. Interactive Query for Dynamic Network Analytics. Haoyu Song. HUAWEI TECHNOLOGIES Co., Ltd.

Data Centric Computing

Latency Hiding by Redundant Processing: A Technique for Grid enabled, Iterative, Synchronous Parallel Programs

An Experience in Accessing Grid Computing from Mobile Device with GridLab Mobile Services

RedGRID: Related Works

A Chromium Based Viewer for CUMULVS

CS Understanding Parallel Computing

Computational Mini-Grid Research at Clemson University

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

Chapter 1 Preliminaries

Combined Object-Lambda Architectures

Chapter 18. Parallel Processing. Yonsei University

GRID*p: Interactive Data-Parallel Programming on the Grid with MATLAB

6.1 Multiprocessor Computing Environment

Concurrent/Parallel Processing

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Mobile and Ubiquitous Computing

INTELLIGENCE DRIVEN GRC FOR SECURITY

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE

High-Performance and Parallel Computing

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz

MOHA: Many-Task Computing Framework on Hadoop

A Steering and Visualization Toolkit for Distributed Applications

Grid Computing. Grid Computing 2

Chapter 1. Preliminaries

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Compilers and Compiler-based Tools for HPC

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

A Grid Web Portal for Aerospace

Usage of LDAP in Globus

The Case of the Missing Supercomputer Performance

Fujitsu s Approach to Application Centric Petascale Computing

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Chapter 18 Parallel Processing

02 - Distributed Systems

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

Optimizing and Managing File Storage in Windows Environments

Overview of the Class

Transcription:

Compilers and Run-Time Systems for High-Performance Computing Blurring the Distinction between Compile-Time and Run-Time Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/compilerruntime.pdf

Context Explosive Growth of Information Technology Now represents 20 percent of economy, 35 percent of GDP growth Essential to operation of most organizations, especially government

Context Explosive Growth of Information Technology Now represents 20 percent of economy, 35 percent of GDP growth Essential to operation of most organizations, especially government Enormous Demand for Software Shortage of IT Professionals Challenge: double the number of CS graduates

Context Explosive Growth of Information Technology Now represents 20 percent of economy, 35 percent of GDP growth Essential to operation of most organizations, especially government Enormous Demand for Software Shortage of IT Professionals Challenge: double the number of CS graduates Complex Computer Architectures Deep memory hierarchies, high degrees of parallelism Heterogeneous, geographically distributed platforms Changes in performance of nodes and links during execution

Context Explosive Growth of Information Technology Now represents 20 percent of economy, 35 percent of GDP growth Essential to operation of most organizations, especially government Enormous Demand for Software Shortage of IT Professionals Challenge: double the number of CS graduates Complex Computer Architectures Deep memory hierarchies, high degrees of parallelism Heterogeneous, geographically distributed platforms Changes in performance of nodes and links during execution Complex Applications Many diverse components, dynamic, adaptive, unstructured

Philosophy Compiler Technology = Off-Line Processing Goals: improved performance and language usability Making it practical to use the full power of the language

Philosophy Compiler Technology = Off-Line Processing Goals: improved performance and language usability Making it practical to use the full power of the language Trade-off: preprocessing time versus execution time Rule: performance of both compiler and application must be acceptable to the end user

Philosophy Compiler Technology = Off-Line Processing Goals: improved performance and language usability Making it practical to use the full power of the language Trade-off: preprocessing time versus execution time Rule: performance of both compiler and application must be acceptable to the end user Examples Macro expansion PL/I macro facility 10x improvement with compilation

Philosophy Compiler Technology = Off-Line Processing Goals: improved performance and language usability Making it practical to use the full power of the language Trade-off: preprocessing time versus execution time Rule: performance of both compiler and application must be acceptable to the end user Examples Macro expansion PL/I macro facility 10x improvement with compilation Query processing Dramatic improvement in speed through planning

Philosophy Compiler Technology = Off-Line Processing Goals: improved performance and language usability Making it practical to use the full power of the language Trade-off: preprocessing time versus execution time Rule: performance of both compiler and application must be acceptable to the end user Examples Macro expansion PL/I macro facility 10x improvement with compilation Query processing Dramatic improvement in speed through planning Communication planning in dynamic applications Develop efficient communication schedules at run time

Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger... I believe that had we failed to produce efficient programs, the widespread use of languages like FORTRAN would have been seriously delayed. John Backus

Preliminary Conclusions Definition of Application Will Become Fuzzy Knowledge of the computation will be revealed in stages Examples: Compilation with input data, Compiler-generated run-time preprocessing Optimization with late binding of target platform Compilation based on predefined component libraries

Preliminary Conclusions Definition of Application Will Become Fuzzy Knowledge of the computation will be revealed in stages Examples: Compilation with input data, Compiler-generated run-time preprocessing Optimization with late binding of target platform Compilation based on predefined component libraries Performance Will Be More Elusive Even reliable performance will be hard to achieve Compiler will need to be even more heroic, Yet programmer will continue to want control

Preliminary Conclusions Definition of Application Will Become Fuzzy Knowledge of the computation will be revealed in stages Examples: Compilation with input data, Compiler-generated run-time preprocessing Optimization with late binding of target platform Compilation based on predefined component libraries Performance Will Be More Elusive Even reliable performance will be hard to achieve Compiler will need to be even more heroic, Yet programmer will continue to want control Compilers Structure Will Be More Flexible Compilation will be carried out in stages

Compiling with Data Program Compiler Application

Compiling with Data Program Compiler Slowly- Slowly- Changing Changing Data Data Reduced Application

Compiling with Data Program Compiler Rapidly- Rapidly- Changing Changing Data Data Slowly- Slowly- Changing Changing Data Data Reduced Application Answers

Run-Time Compilation Program Compiler Application

Run-Time Compilation Program Compiler Slowly- Slowly- Changing Changing Data Data Pre-Optimizer Pre-Optimizer Application

Run-Time Compilation Program Compiler Rapidly- Rapidly- Changing Changing Data Data Slowly- Slowly- Changing Changing Data Data Pre-Optimizer Pre-Optimizer Application Answers

Bandwidth as Limiting Factor Program and Machine Balance Program Balance: Average number of bytes that must be transferred in memory per floating point operation Machine Balance: Average number of bytes the machine can transfer from memory per floating point operation

Bandwidth as Limiting Factor Program and Machine Balance Program Balance: Average number of bytes that must be transferred in memory per floating point operation Machine Balance: Average number of bytes the machine can transfer from memory per floating point operation Applications Flops L1 Reg L2 L1 Mem L2 Convolution 1 6.4 5.1 5.2 Dmxpy 1 8.3 8.3 8.4 Mmjki (o2) 1 24.0 8.2 5.9 FFT 1 8.3 3.0 2.7 SP 1 10.8 6.4 4.9 Sweep3D 1 15.0 9.1 7.8 SGI Origin 1 4 4 0.8

Cache and Bandwidth

Cache and Bandwidth 6.25 % Utilization L2 Cache 128 Bytes Memory

Cache and Bandwidth 25 % Utilization L1 Cache 32 Bytes 6.25 % Utilization L2 Cache 128 Bytes Memory

Cache and Bandwidth Register 8 Bytes 100 % Utilization L1 Cache 32 Bytes 25 % Utilization 6.25 % Utilization L2 Cache 128 Bytes Memory

Dynamic Data Packing Suppose the Calculation is Irregular Example: Molecular Dynamics Force calculations (pairs of forces) Updating locations (single force per update)

Dynamic Data Packing Suppose the Calculation is Irregular Example: Molecular Dynamics Strategy Force calculations (pairs of forces) Updating locations (single force per update) Dynamically reorganize data So locations used together are updated together Dynamically reorganize interactions So indirect accesses are not needed Example: first touch Assign elements to cache lines in order of first touch by pairs calculation

First-Touch Ordering Original Ordering P 5 P 1 P 4 P 3 P 2

First-Touch Ordering Original Ordering P 5 P 1 P 4 P 3 P 2 Interaction Pairs P 1 P 1 P 1 P 2 P 3 P 4 P 2 P 2 P 3 P 5

First-Touch Ordering Original Ordering P 5 P 1 P 4 P 3 P 2 Interaction Pairs P 1 P 1 P 1 P 2 P 3 P 4 P 2 P 2 P 3 P 5 First-Touch Ordering P 1 P 2 P 3 P 4 P 5

Performance Results 1 Moldyn original data regrouping base packing opt packing 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Exe. time L1 misses L2 misses TLB misses

Performance Results 2 Magi original data regrouping base packing opt packing 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Exe. time L1 misses L2 misses TLB misses

Irregular Multilevel Blocking Associate a tuple of block numbers with each particle One block number per level of the memory hierarchy Block number = selected bits of particle address Particle address A B C L2 block number TLB block number L1 block number

Irregular Multilevel Blocking Associate a tuple of block numbers with each particle One block number per level of the memory hierarchy Block number = selected bits of particle address Particle address A B C L2 block number TLB block number L1 block number For an interaction pair, interleave block numbers for particles A A B B C C Sorting by composite block number multi-level blocking

Dynamic Optimization Program Compiler Application

Dynamic Optimization Program Compiler Configuration Configuration And And Data Data Dynamic Optimizer (Optimizing (Optimizing Loader) Loader) Application

Dynamic Optimization Program Compiler Configuration Configuration And And Data Data Dynamic Optimizer (Optimizing (Optimizing Loader) Loader) Rapidly- Rapidly- Changing Changing Data Data Application Answers

A Software Grand Challenge Application Development and Performance Management for Grids Problems: Reliable performance on heterogeneous platforms Varying load On computation nodes and on communications links Challenges: Presenting a high-level programming interface If programming is hard, its useless Designing applications for adaptability Mapping applications to dynamically changing architectures Determining when to interrupt execution and remap Application monitors Performance estimators

National Distributed Computing

National Distributed Computing

National Distributed Computing Supercomputer

National Distributed Computing Supercomputer Database

National Distributed Computing Supercomputer Database Supercomputer

National Distributed Computing Database Supercomputer Database Supercomputer

What Is a Grid? Collection of computing resources Varying in power or architecture Potentially dynamically varying in load Unreliable? No hardware shared memory Interconnected by network Links may vary in bandwidth Load may vary dynamically Distribution Across room, campus, state, nation, globe Inclusiveness Distributed-memory parallel computer is a degenerate case

Globus Developed by Ian Foster and Carl Kesselman Originally to support the I-Way (SC-96) Basic Services for distributed computing Accounting Resource directory User authentication Job initiation Communication services (Nexus and MPI) Applications are programmed by hand User responsible for resource mapping and all communication Many applications, most developed with Globus team Even Globus developers acknowledge how hard this is

What is Needed Compiler and language support for reliable performance dynamic reconfiguration, optimization for distributed targets

What is Needed Compiler and language support for reliable performance dynamic reconfiguration, optimization for distributed targets Abstract Grid programming models design of an implementation strategy for those models

What is Needed Compiler and language support for reliable performance dynamic reconfiguration, optimization for distributed targets Abstract Grid programming models design of an implementation strategy for those models Easy-to-use programming interfaces problem-solving environments

What is Needed Compiler and language support for reliable performance dynamic reconfiguration, optimization for distributed targets Abstract Grid programming models design of an implementation strategy for those models Easy-to-use programming interfaces problem-solving environments Robust reliable numerical and data-structure libraries predictability and robustness of accuracy and performance reproducibility, fault tolerance, and auditability

What is Needed Compiler and language support for reliable performance dynamic reconfiguration, optimization for distributed targets Abstract Grid programming models design of an implementation strategy for those models Easy-to-use programming interfaces problem-solving environments Robust reliable numerical and data-structure libraries predictability and robustness of accuracy and performance reproducibility, fault tolerance, and auditability Performance monitoring and control strategies deep integration across compilers, tools, and runtime systems performance contracts and dynamic reconfiguration

Programming Models Distributed Collection of Objects (for serious experts) message passing for communiction Problem-Solving Environment (for non-experts) packaged components graphical or scripting language for glue Distribution of Shared-Memory Programs (for experts) language-based decomposition specification from programmer parametrizable for reconfiguration example: reconfigurable distributed arrays (DAGH) implemented as distributed object collection implicit or explicit communications

Grid Compilation Architecture Goal: reliable performance under varying load Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer GrADS Project (NSF NGS): Berman, Chien, Cooper, Dongarra, Foster, Gannon, Johnsson, Kennedy, Kesselman, Reed, Torczon, Wolski

Grid Compilation Architecture Execution Environment Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer

Performance Contracts At the Heart of the GrADS Model Fundamental mechanism for managing mapping and execution What are they? Mappings from resources to performance Mechanisms for determining when to interrupt and reschedule Abstract Definition Random Variable: ρ(a,i,c,t 0 ) with a probability distribution A = app, I = input, C = configuration, t 0 = time of initiation Important statistics: lower and upper bounds (95% confidence) Issue: Is ρ a derivative at t 0? (Wolski)

Grid Compilation Architecture Program Preparation System Execution Environment Performance Feedback Software Components Performance Problem Real-time Performance Monitor Whole- Program Compiler Source Application Configurable Object Program Service Negotiator Scheduler Negotiation Grid Runtime System Libraries Dynamic Optimizer

Configurable Object Program Representation of the Application Supporting dynamic reconfiguration and optimization for distributed targets Includes Program intermediate code Annotations from the compiler Reconfiguration strategy Historical information (run profile to now) Reconfiguration strategies Aggregation of data regions (submeshes) Aggregation of tasks Definition of parameters Used for algorithm selection

GrADS Testbeds MicroGrid (Andrew Chien) Cluster of Intel PCs Runs standard Grid software (Globus, Nexus) Permits simulation of varying loads Network and processor Extensive performance modeling MacroGrid (Carl Kesselman) Collection of processors running Globus At all 8 GrADS sites Permits experimentation with real applications Cactus (Ed Seidel)

Begin Modestly Research Strategy application experience to identify opportunities prototype reconfiguration system with performance monitoring, without dynamic optimization prototype reconfigurable library Move from Simple to Complex Systems begin with heterogeneous clusters refinements of reconfiguration system and performance contract mechanism use an artificial testbed to test performance under varying conditions Experiment with Real Applications someone cares about the answers

Programming Productivity Challenges programming is hard professional programmers are in short supply high performance will continue to be important

Programming Productivity Challenges programming is hard professional programmers are in short supply high performance will continue to be important One Strategy: Make the End User a Programmer professional programmers develop components users integrate components using: problem-solving environments (PSEs) scripting languages (possibly graphical) examples: Visual Basic, Tcl/Tk, AVS, Khoros

Programming Productivity Challenges programming is hard professional programmers are in short supply high performance will continue to be important One Strategy: Make the End User a Programmer professional programmers develop components users integrate components using: problem-solving environments (PSEs) scripting languages (possibly graphical) examples: Visual Basic, Tcl/Tk, AVS, Khoros Compilation for High Performance translate scripts and components to common intermediate language optimize the resulting program using interprocedural methods

Telescoping Languages L 1 Class 1 Library

Telescoping Languages L 1 Class 1 Library Compiler Generator Could run for hours L 1 Compiler 1

Telescoping Languages L 1 Class 1 Library Compiler Generator Could run for hours Script Script Translator L 1 Compiler 1 understands library calls as primitives Vendor Compiler Optimized Application

Telescoping Languages: Advantages Compile times can be reasonable More compilation time can be spent on libraries Amortized over many uses Script compilations can be fast Components reused from scripts may be included in libraries High-level optimizations can be included Based on specifications of the library designer Properties often cannot be determined by compilers Properties may be hidden after low-level code generation User retains substantive control over language performance Mature code can be built into a library and incorporated into language

Example: HPF Revisited Fortran 90 90 Program HPF HPF Translator Global Optimizer MPI MPI Code Generator

Example: HPF Revisited Distribution Library Distribution Precompiler Fortran 90 90 Program HPF HPF Translator Global Optimizer MPI MPI Code Generator

Example: HPF Revisited Distribution Library Distribution Precompiler Fortran 90 90 Program HPF HPF Translator Global Optimizer MPI MPI Code Generator Distribute (HilbertLib): A,B Do i = 1,100 A(i) = B(i) + C Enddo A.putBlock(1,100, B.getBlock(1,100) + C)

Flexible Compiler Architecture Flexible Definition of Computation Parameters program scheme base library sequence (l1, l2,, lp) subprogram source files (s1, s2,..., sn) run history (r1, r2,..., rk) data sets (d1, d2,..., dm) target configuration

Flexible Compiler Architecture Flexible Definition of Computation Parameters program scheme base library sequence (l1, l2,, lp) subprogram source files (s1, s2,..., sn) run history (r1, r2,..., rk) data sets (d1, d2,..., dm) target configuration Compilation = Partial Evaluation several compilation steps as information becomes available

Flexible Compiler Architecture Flexible Definition of Computation Parameters program scheme base library sequence (l1, l2,, lp) subprogram source files (s1, s2,..., sn) run history (r1, r2,..., rk) data sets (d1, d2,..., dm) target configuration Compilation = Partial Evaluation several compilation steps as information becomes available Program Management When to back out of previous compilation decisions due to change When to invalidate certain inputs Examples: change in library or run history

Summary Target Platforms, Languages, and Apps Becoming More Complex Platforms: Parallel, heterogeneous, deep memory hierarchies Applications: dynamic, irregular, extensive use of domain libraries Programming: component development, system composition Example: The Grid as a Problem-Solving System Seamless integration of access to remote resources Programming support infrastructure is a challenge Execution of applications must be adaptive Must manage execution to reliable completion Ideally, should support high-level domain-specific programming Telescoping languages Compiler Structure Will Be Correspondingly Complex Partial evaluation in stages with incremental information