Programming Models for Supercomputing in the Era of Multicore

Similar documents
Scalable Shared Memory Programing

An Introduction to Parallel Programming

Trends and Challenges in Multicore Programming

Universal Parallel Computing Research Center at Illinois

Multiprocessors & Thread Level Parallelism

Marco Danelutto. May 2011, Pisa

Overview of research activities Toward portability of performance

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing...

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Introduction to parallel Computing

Parallel Languages: Past, Present and Future

Parallelism Marco Serafini

Lecture 9: MIMD Architectures

CUDA GPGPU Workshop 2012

Introduction to Parallel Performance Engineering

Advanced Parallel Programming. Is there life beyond MPI?

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 9: MIMD Architectures

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Martin Kruliš, v

CS3350B Computer Architecture

A Manifesto for Supercomputing Research. Marc Snir

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing. Introduction to Parallel Computing

Assignment 5. Georgia Koloniari

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Computer Architecture

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Lecture 9: MIMD Architecture

GPUs and Emerging Architectures

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Architecture Crash course

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

The Future of High Performance Computing

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

CS 475: Parallel Programming Introduction

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Application Programming

Technical Computing Suite supporting the hybrid system

OpenACC Course. Office Hour #2 Q&A

UPCRC Overview. Universal Computing Research Centers launched at UC Berkeley and UIUC. Andrew A. Chien. Vice President of Research Intel Corporation

Using GPUs to compute the multilevel summation of electrostatic forces

Intel Thread Building Blocks, Part IV

Parallelization and Synchronization. CS165 Section 8

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Concurrency: what, why, how

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Technology for a better society. hetcomp.com

Parallel Hybrid Computing Stéphane Bihan, CAPS

NAMD Serial and Parallel Performance

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 3: Intro to parallel machines and models

Chapter 18 Parallel Processing

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

6.1 Multiprocessor Computing Environment

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Evaluating the Portability of UPC to the Cell Broadband Engine

Performance analysis basics

Review: Creating a Parallel Program. Programming for Performance

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

High performance computing and numerical modeling

Why you should care about hardware locality and how.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Parallel and High Performance Computing CSE 745

Distributed OS and Algorithms

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Chap. 4 Multiprocessors and Thread-Level Parallelism

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Performance Tools for Technical Computing

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Distributed Systems. Lecture 4 Othon Michail COMP 212 1/27

Parallelism for the Masses: Opportunities and Challenges Andrew A. Chien

PRIMEHPC FX10: Advanced Software

Intel Parallel Studio 2011

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

ECE/CS 250 Computer Architecture. Summer 2016

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

MPI in 2020: Opportunities and Challenges. William Gropp

CSCI 4717 Computer Architecture

Transcription:

Programming Models for Supercomputing in the Era of Multicore Marc Snir

MULTI-CORE CHALLENGES 1

Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need to handle systems with millions of concurrent threads And contemplate with horror the possibility of systems with billions of threads Need to emphasize scalability not best performance for fixed number of cores. Need to be able to easily replace inter-chip parallelism with intra-chip parallelism Homogeneous programming model (e.g., MPI all around) is preferable to heterogeneous programming model (e.g., MPI+OpenMP). 2

Memory Wall: a Persistent Problem Chip CPU performance increases ~60% CGAR Memory bandwidth increases ~25%-35% CGAR Memory latency decreases ~5%- 7% CGAR Most area and energy chip budget is spent on storing and moving bits (temporal and spatial communication) Locality and communication management are major algorithmic issues, hence need be exposed in programming language 3

Reliability and Variance Hypothesis: MTTF per chip does not decrease hardware is used to mask errors Redundant hardware, redundant computations, light-weight checkpointing Programmers do not have to handle faults Programmers have to handle variance in execution time Variance due to size square root growth Variance due to error correction Variance due to power management Manufacturing variance Heterogeneous architectures System noise (jitter) not much of a problem, really Application variance: adaptive codes (e.g., AMR), multiscale codes Loosely synchronous SPMD model with one thread per core breaks down need virtualization 4

NEW OPPORTUNITIES 5

Ubiquitous Parallelism Effective leverage of parallelism is essential to the business model of Intel and Microsoft Focus on turnaround, not throughput Need to enable large number of applications need to develop parallel application development environment However: Four orders of magnitude gap Focus on ease of programming and scalability, not peak performance Little interest in multi-chip systems Q: how can HPC leverage the client parallel SW stack? 6

Expected Trickle-Up Technologies New languages (much easier to stretch well-supported parallel languages than have DARPA create a market for new HPC languages) Significantly improved parallelizing compilers, parallel runtimes, parallel IDEs (bottleneck has been more lack of market, less lack of good research ideas) New emphasis on deterministic (repeatable) parallel computation models focus on producer-consumer synchronization, not on mutual exclusion Serial semantics, parallel performance model Parallel algorithms are designed by programmers, not inferred by compilers Every computer scientist will be educated to think parallel 7

Universal Parallel Computing Research Centers Intel and Microsoft are partnering with academia to create two Universal Parallel Computing Research Centers located at UC Berkeley and UIUC. Goal: Make parallel programming synonymous with programming 8

UPCRC One Slide Summary Parallel programming can be a child s play E.g., Squeak Etoys No more Swiss army knifes: need, and can afford, multiple solutions Simplicity is hard Simpler languages + more complex architectures = a feast for compiler developers Need more abstraction layers that abstract both semantics and QoS What is a QoS preserving mapping? What hooks can HW provide to facilitate programming? Sync primitives, debug/perf support Performance will enable new client applications An intelligent PDA will need, eventually, the compute power of a human brain 9

KEY TECHNOLOGIES 10

Task Virtualization Multiple logical tasks are scheduled on each physical core; tasks are scheduled nonpreemptively; task migration is supported Hides variance and communication latency Helps with scalability Needed for modularity Improves performance E.g., AMPI, Charm++ (Kale, UIUC), TBB (Intel) Supported by hardware and/or run-time Can be implemented below MPI or PGAS languages Two styles: Varying, user controlled number of tasks (AMPI) Locality achieved by load balancer Recursive (hierarchical) range splitter (TBB) Locality achieved implicitly 11

Compiled Communication Replace message-passing library (e.g., MPI) with compiler generated communication (e.g., PGAS languages) Avoids SW overhead and memory copies of library calls Maps directly to platform specific HW communication mechanisms, in a portable manner Transparently port from shared memory HW to distributed memory HW Enables compiler optimization of communications, across multiple communications 12

Shared Memory Models Global name space variable name does not change when its location changes simplifies programming Not copying but caching Well explored alternatives: Pure HW caching (coherent shared memory); and pure SW caching (compiled, e.g. for PGAS languages). Need to better explore HW-assisted caching (fast, HWsupported, cache access; possibly slow cache update) Probably need some user control of caching (logical cache line definition) Probably don t need user control of home location (as done in PGAS languages) Conflicting accesses must be synchronized Current Java and soon to be C++ semantics Races must be detected and generate exceptions Can be done with some variants of thread level speculation

Synchronization Primitives Frequently needed: Deterministic synchronization producer-consumer: barrier, disciplined use of full/empty bits (single writer) Accumulate (reduction) Rarely needed: Nondeterministic synchronization mutual exclusion, atomic sections (transactions) Note: transactional memory HW good for lightly contested atomic sections; not efficient for producer-consumer synchronization and for reductions Simple accumulates best done on memory side (if core contributes few values) Significant reduction of memory bus traffic Requests can be combined, to avoid congestions CPU CPU Bad id, opc, addr id, DATA Mem id, opc, addr, DATA Good load store Mem id, opc, addr, data load store 14

Parallel Patterns Challenge EASY TO EXPLAIN PARALLEL ALGORITHM = SIMPLE CODE 15

A x B = C Matrix Product B i k j Generic Data Parallel Control Parallel Need to easily express 3D computation domain One 3D iterator, not a triple nested loop Need to easily express partition into subcubes Automatically generate parallel reduction Avoid allocation of n 3 variables 16

Less Trivial Example: NAMD Patches object: atoms in cell Change each iteration (or each few iterations) Computes object: pairs of atoms from neighboring cells (to compute forces) Avoid allocation of variable for each pair (Kale) Patches Can one go from such declarative description to code? Computes 17

NAMD Communication Patch Integration Multicast Point to Point PME Bonded Computes Non-bonded Computes Reductions Point to Point Patch Integration Can one have composition language that expresses above diagram in natural way? 18

Algorithmic Changes, for Scale If cell size < cutoff radius, then computes object should consist of pairs of atoms from subset of cells within cutoff radius [Snir/Shaw] Can choose 1D FFT or 2D FFT Can one delay binding until problem and machine parameters are known? 19

THE SOLUTION IS MULTITHREADED 20

Multiscale Programming Multiple languages (C, C++, Fortran, OpenMP, Python, CAF, UPC) and libraries Multiple levels of code generation & tuning Domain specific code generators & high-level optimizers (Spiral Puschel et al, quantum chemistry -- Sadayappan et al) Library autotuning tuning pattern selection Algorithm selection Refactorings and source to source transformations Static compilation Template expansion Run-time compilation continuous optimization Do not think parallel language; think programming environment that integrates synergistically all levels 21

Multiscale Compilation DSE Gen. Tools High-Level Code Objects Correctness Tools Low-Level Code Objects Implicit Parallel Code Explicit Parallel Code Enhanced Intermediate Representation Domain Specific Code User Annotations, Refactoring Logs, System Annotations Deep Compiler Analysis QoS Annotations, System Annotations Link-time/Run Time Adaptation Run-Time/OS/HW DSE Library Gen. Tools Tunable Library 22

Summary The frog is boiling: tuning code is ridiculously hard and is getting harder We have the power to change: We can build much better parallel programming environments the problem is economics, not technology There is no silver bullet: Not one technology, but a good integration of many I ran out of platitudes Time for 23