What does Heterogeneity bring?

Similar documents
Computing architectures Part 2 TMA4280 Introduction to Supercomputing

High Performance Computing with Accelerators

Compilation for Heterogeneous Platforms

Parallel Exact Inference on the Cell Broadband Engine Processor

Roadrunner and hybrid computing

Addressing Heterogeneity in Manycore Applications

COSC 6385 Computer Architecture - Multi Processor Systems

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Heterogeneous Processing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

Trends in HPC (hardware complexity and software challenges)

How to Write Fast Code , spring th Lecture, Mar. 31 st

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

CellSs Making it easier to program the Cell Broadband Engine processor

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

The Mont-Blanc approach towards Exascale

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

GPUs and Emerging Architectures

HPC and Accelerators. Ken Rozendal Chief Architect, IBM Linux Technology Cener. November, 2008

High Performance Computing (HPC) Introduction

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Turbostream: A CFD solver for manycore

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

WHY PARALLEL PROCESSING? (CE-401)

Experts in Application Acceleration Synective Labs AB

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Massively Parallel Architectures

GPU for HPC. October 2010

Introduction to Parallel Computing

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Advances of parallel computing. Kirill Bogachev May 2016

Support for Programming Reconfigurable Supercomputers

OP2 FOR MANY-CORE ARCHITECTURES

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

The Stampede is Coming: A New Petascale Resource for the Open Science Community

HPC Architectures. Types of resource currently in use

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The Art of Parallel Processing

Parallel Computer Architecture - Basics -

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Mapping MPI+X Applications to Multi-GPU Architectures

n N c CIni.o ewsrg.au

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computer Architecture Concepts

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Carlo Cavazzoni, HPC department, CINECA

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

Graphics Processor Acceleration and YOU

Technology for a better society. hetcomp.com

Introduction to CELL B.E. and GPU Programming. Agenda

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

End User Update: High-Performance Reconfigurable Computing

Analysis of a Computational Biology Simulation Technique on Emerging Processing Architectures

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Parallelism. CS6787 Lecture 8 Fall 2017

2008 International ANSYS Conference

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Intel Xeon Phi архитектура, модели программирования, оптимизация.

High performance Computing and O&G Challenges

Flexible Architecture Research Machine (FARM)

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

The Use of Cloud Computing Resources in an HPC Environment

High Performance Computing. University questions with solution

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos

ECE 574 Cluster Computing Lecture 1

Lecture 1: Gentle Introduction to GPUs

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Practical Scientific Computing

Lecture 3: Intro to parallel machines and models

Pedraforca: a First ARM + GPU Cluster for HPC

Trends and Challenges in Multicore Programming

Parallel and High Performance Computing CSE 745

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Steve Scott, Tesla CTO SC 11 November 15, 2011

The Era of Heterogeneous Computing

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Processor Architecture and Interconnect

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to Parallel Programming

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallel Computing Platforms

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Trends in HPC Architectures and Parallel

Transcription:

What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006

Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or composition throughout. Heterogeneous: Consisting of dissimilar elements or parts; not homogeneous Hybrid combine two or more different technologies A computer designed to handle both analog and digital data. Also known as analog-digital computer. Metacomputer coupling of parallel systems recognize the strengths of each machine, and use it accordingly

Chip-Level Heterogeneity Past: Moore s Law has provided increased logic density Chips got more powerful and more capable Flops, registers, caches, out-of-order, pending memory ops Processor speed has increased due to smaller feature sizes Present Moore s law still works BUT Clock speeds have reached a plateau Processor complexity and instruction level parallelism appears to have reached a plateau as well Multi-core processors are now the norm memory & off-chip BW per core is actually decreasing

Chip-Level Heterogeneity Future More and more devices per chip (4,8,64,128, ) Moore s Law still applies Large device counts could lead to arrays and tiles of interconnected devices Not just multi-core Multiple memory hierarchies are likely Special purpose devices almost certainly will be added

System-Level Heterogeneity You are probably using one now! Desktop PC or gaming laptop with CPU and a separate graphics card Current Machines & Devices Cray XD1 (FPGA) SRC MapStation (FPGA) GPGPU (computing on graphics cards) Clearspeed SIMD PCI cards FPGA PCI cards Cell BE chip

System-Level Heterogeneity New node-level initiatives for accelerators AMD Torrenza (socket, HT & HTX level devices) Intel/IBM Geneseo (PCIe card level devices) New heterogeneous/hybrid HPC systems LANL Roadrunner (Opterons w/ Cell Blades) Tokyo Institute of Technology TSUBAME (Opterons w/ Clearspeed cards) TeraSoft (Cell Blade cluster)

Roadrunner Architecture Final 2008 Cell-Accelerated System 4 per Cell blades 4 per 144 Opteron nodes 144 IB switch 15 CU clusters IB switch 2 nd stage InfiniBand interconnect (8 switches) One Accelerated Node Cell edp Cell edp HSD C 1GB/s 4 point-to-point attached Cell Blades Repeate r 1GB/s PCIe PCIe AMD AMD AMD AMD HTX 1GB/s

Roadrunner Heterogeneity An ~850 GFlop Node! Node Memory (4-socket NUMA) Node (4-socket dual-core Opteron) HTX IB PPC Processor Parallel SPE Processors Local Memories Cell Memory IB PCIe Cell Memory PCIe PPC Processor Parallel SPE Processors Local Memories 3 more dual-cell Blades per node 8-way parallel

How is programming heterogeneous systems/devices similar to how we program now? Parallel execution Threads OpenMP (possibly, just NOT at the loop level) Tiling, work queues, blocking of data Communication/Data Transfers Overlapped communication & computation, not just amortization send/recv/put/get data communications (MPI like) Vector operations (SSE, Cray, data dependencies) Amdahl s law still applies to any parallel speedup C language (lowest common denominator)

How is programming heterogeneous systems/devices different from how we program now? Decompose code into 2, 3, or more cooperating parts with different functions Compute, coordination, upload/download, relay messages Explicit user visibility of memories & data transfers DMAs and overlap of communication & computation Local memories of limited sizes Data alignment constraints Worry about granularity of code parts Flops vs. data bandwidth requirements Blocks of code, entire routines, collection of routines implementing an algorithm, entire physics package Endian conversions may be required (e.g. Opteron-Cell)

Deja Vu CDC 6600 First acknowledged supercomputer It was HETEROGENEOUS! Mid 1960 s

Deja Vu Experienced (i.e. old) programmers may actually have some knowledge advantages! Teach new dogs old tricks Techniques from the past with similarities to heterogeneous programming ideas CDC 6600/7600 Peripheral Processors (10-way I/O coprocessor) CDC 6600 ECS & 7600 LCM/SCM (explicit memory hierarchies) Out-of-core buffered I/O (overlapped data streaming) Fortran overlays (program image size) Cray vector techniques for non-obvious kernels Floating Point Systems Array processors Analog-Digital computer (hybrid programming) IBM JCL & DD cards (just kidding)