Multicore Architecture and Hybrid Programming

Similar documents
Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Performance of Variant Memory Configurations for Cray XT Systems

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

MPI & OpenMP Mixed Hybrid Programming

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

XT Node Architecture

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Preparing GPU-Accelerated Applications for the Summit Supercomputer

General introduction: GPUs and the realm of parallel architectures

Carlo Cavazzoni, HPC department, CINECA

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Performance of Variant Memory Configurations for Cray XT Systems

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-scale Multicore Clusters

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

PC I/O. May 7, Howard Huang 1

Parallel Computing. November 20, W.Homberg

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Our new HPC-Cluster An overview

Final Lecture. A few minutes to wrap up and add some perspective

Hybrid Computing. Lars Koesterke. University of Porto, Portugal May 28-29, 2009

Chapter 18 - Multicore Computers

WHY PARALLEL PROCESSING? (CE-401)

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Introduction to Parallel Computing

What does Heterogeneity bring?

EE/CSCI 451: Parallel and Distributed Computation

COSC 6374 Parallel Computation. Parallel Computer Architectures

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

COSC 6385 Computer Architecture - Multi Processor Systems

Flexible Architecture Research Machine (FARM)

Hybrid programming with MPI and OpenMP On the way to exascale

Simultaneous Multithreading on Pentium 4

COSC 6374 Parallel Computation. Parallel Computer Architectures

Performance Considerations: Hardware Architecture

Compute Node Linux (CNL) The Evolution of a Compute OS

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

The Red Storm System: Architecture, System Update and Performance Analysis

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

LS-DYNA Scalability Analysis on Cray Supercomputers

CellSs Making it easier to program the Cell Broadband Engine processor

Message Passing with MPI

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Lecture 2 Parallel Programming Platforms

CS 475: Parallel Programming Introduction

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Design Principles for End-to-End Multicore Schedulers

The Use of Cloud Computing Resources in an HPC Environment

CSE 392/CS 378: High-performance Computing - Principles and Practice

Cray XE6 Performance Workshop

Lecture 7: Distributed memory

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Advanced Crash Course in Supercomputing: Programming Project

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Parallel Architectures

ABySS Performance Benchmark and Profiling. May 2010

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Scalable Computing at Work

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

What is Parallel Computing?

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to Parallel Programming

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HPC Architectures. Types of resource currently in use

CUDA. Matthew Joyner, Jeremy Williams

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Processor Architecture and Interconnect

Parallel Numerics, WT 2013/ Introduction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Trends in HPC (hardware complexity and software challenges)

How to Write Fast Code , spring th Lecture, Mar. 31 st

Computer Architecture Spring 2016

Kaisen Lin and Michael Conley

Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

What are Clusters? Why Clusters? - a Short History

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Martin Kruliš, v

COMP528: Multi-core and Multi-Processor Computing

Oak Ridge National Laboratory Computing and Computational Sciences

High Performance Computing (HPC) Introduction

Tile Processor (TILEPro64)

Issues in Developing a Thread-Safe MPI Implementation. William Gropp Rajeev Thakur

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Creating a Scalable Microprocessor:

Introduction Hybrid Programming. Sebastian von Alfthan 13/8/2008 CSC the Finnish IT Center for Science PRACE summer school 2008

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Transcription:

Multicore Architecture and Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory hartmanbakrj@ornl.gov 2004-2009 Rebecca Hartman-Baker. Reproduction permitted for non-commercial, educational use only.

Outline I. Computer Architecture II. Hybrid Programming 2

I. COMPUTER ARCHITECTURE Mare Nostrum, installed in Chapel Torre Girona, Barcelona Supercomputing Center. By courtesy of Barcelona Supercomputing Center -- http://www.bsc.es/ 3

I. Computer Architecture 101 Source: http://en.wikipedia.org/wiki/image:computer_abstraction_layers.svg (author unknown) 4

Computer Architecture 101 Processors Memory Memory Hierarchy TLB Interconnects Glossary 5

Computer Architecture 101: Processors CPU performs 4 basic operations: Fetch Decode Execute Writeback Source: http://en.wikipedia.org/wiki/image:cpu_block_diagram.svg 6

CPU Operations Fetch Retrieve instruction from program memory Location in memory tracked by program counter (PC) Instruction retrieval sped up by caching and pipelining Decode Interpret instruction by breaking into meaningful parts, e.g., opcode, operands Execute Connect to portions of CPU to perform operation, e.g., connect to arithmetic logic unit (ALU) to perform addition Writeback Write result of execution to memory 7

Computer Architecture 101: Memory Hierarchy of memory Fast-access memory: small (expensive) Slower-access memory: large (less expensive) Cache: fast-access memory where frequently used data stored Reduces average access time Works because typically, applications have locality of reference Cache in XT4/5 also hierarchical TLB: Translation lookaside buffer Used by memory management hardware to improve speed of virtual address translation 8

Cache Associativity Where to look in cache memory for copy of main memory location? Direct-Mapped/ 1-way Associative: only one location in cache for each main memory location Fully Associative: can be stored anywhere in cache 2-way Associative: two possible locations in cache N-way Associative: N possible locations in cache Doubling associativity ( 1 2, 2 4) has same effect on hit rate as doubling cache size Increasing beyond 4 does not substantially improve hit rate; higher associativity done for other reasons 9

Cache Associativity: Illustration Main Memory Main Memory 0 1 2 3 4 5 Cache Memory 0 1 2 3 0 1 2 3 4 5 Cache Memory 0 1 2 3 6 6 7 7 8 8 9... 9... Direct-Mapped Cache 2-Way Associative Cache 10

Computer Architecture 101: Interconnects Connect nodes of machine to one another Methods of interconnecting Fiber + switches and routers Directly connecting Topology 11 Torus Hypercube Butterfly Tree 2-D Torus Hypercube

Computer Architecture 101: Glossary SSE (Streaming SIMD Extensions): instruction set extension to x86 architecture, allowing CPU to work on multiple instructions in single clock cycle DDR2 (Double Data Rate 2): synchronous dynamic random access memory, operates twice as fast as DDR1 DDR2-xyz: performs xyz million data transfers/second Dcache: cache devoted to data storage Icache: cache devoted to instruction storage STREAM: data flow 12

Facts and Figures Kraken (Cray XT5) Athena (Cray XT4) Compute Nodes 16,512 4512 3936 Node Memory 2.6 GHz AMD Istanbul Dual Six- Core 16 GB/node DDR2-800 Network Cray SeaStar 2, 3- D torus 2.3 GHz AMD Opteron Quad Core 8 GB/node DDR2-667/ DDR2-800 Cray SeaStar 2, 3- D torus Ranger (Sun Constella:on) 2.3 GHz AMD Opteron Quad Core x 4 32 GB/node DDR2-667 Peak 1.03 PF 166 TF 579 TF Infiniband switch, fat tree 13

XT4/5 Architecture Hardware Processors Memory Memory Hierarchy TLB System architecture Interconnects Software Operating System Integration CNL vs Linux 14

15 Quad-Core Architecture

16

Quad Core Cache Hierarchy Core 1 Core 2 Core 3 Core 4 Cache Control Cache Control Cache Control Cache Control 64 KB 64 KB 64 KB 64 KB 512 KB 512 KB 512 KB 512 KB 2 MB 17

L1 Cache Core 1 Core 2 C Dedicated 2-way associativity 8 banks Cache Control Cache Control C 2 x 128-bit loads/cycle 64 KB 64 KB 512 KB 512 KB 18 2 MB

L2 Cache Core 1 Core 2 C Dedicated 16-way associativity Cache Control Cache Control C 64 KB 64 KB 512 KB 512 KB 19 2 MB

L3 Cache Core 1 Shared Sharing-aware replacement policy Cache Control 2 MB 20

Cray XT4 Architecture XT4 is 4 th generation Cray MPP Service nodes run full Linux Compute nodes run Compute Node Linux (CNL) 21

Cray XT4 Architecture 2- or 4-way SMP > 35 Gflops/node Up to 8 GB/node OpenMP Support within Socket 22

Cray SeaStar2 Architecture Router connects to 6 neighbors in 3-D torus Peak bidirectional BW 7.6 GB/s; sustained 6 GB/s Reliable link protocol with error correction and retransmission Communications Engine: DMA Engine + PPC 440 Together, perform messaging tasks so AMD processor can focus on computing 6-Port Router DMA Engine HyperTransport Interface Memory DMA Engine and OS together minimize latency with path directly from app to communication hardware (without traps and interrupts) Blade Control Processor Interface PowerPC 440 Processor 23

Cray XT5 Architecture 8-way SMP > 70 Gflops/node Up to 32 GB shared memory/node OpenMP support 24

Software Architecture CLE microkernel on compute nodes Full-featured Linux on service nodes Software architecture eliminates jitter and enables reproducible runtimes Even large machines can reboot in < 30 mins, including filesystem 25

Software Architecture Compute PE (processing element): used for computation only; users cannot directly access compute nodes Service PEs: run full Linux Login: users access these nodes to develop code and submit jobs, function like normal Linux box Network: provide high-speed connectivity with other systems System: run global system services such as system database I/O: provide connectivity to GPFS (global parallel file system) 26

CLE vs Linux CLE (Cray Linux Environment) contains subset of Linux features Minimizes system overhead because little between application and bare hardware 27

Resources: Computer Architecture 101 Wikipedia articles on computer architecture: http://en.wikipedia.org/wiki/computer_architecture, http://en.wikipedia.org/wiki/cpu, http://en.wikipedia.org/wiki/cpu_cache, http://en.wikipedia.org/wiki/ddr2_sdram, http://en.wikipedia.org/wiki/microarchitecture, http://en.wikipedia.org/wiki/sse2, http://en.wikipedia.org/wiki/streaming_simd_extensions Heath, Michael T. (2007) Notes for CS 554, Parallel Numerical Algorithms, http://www.cse.uiuc.edu/courses/cs554/notes/index.html 28

Resources: Cray XT4 Architecture NCCS machines Jaguar: http://www.nccs.gov/computing-resources/jaguar/ Eugene: http://www.nccs.gov/computing-resources/eugene/ Jaguarpf: http://www.nccs.gov/jaguar/ AMD architecture Waldecker, Brian (2008) Quad Core AMD Opteron Processor Overview, available at http://www.nccs.gov/wp-content/uploads/2008/04/ amd_craywkshp_apr2008.pdf Larkin, Jeff (2008) Optimizations for the AMD Core, available at http://www.nccs.gov/wp-content/uploads/2008/04/optimization1.pdf XT4 Architecture Hartman-Baker, Rebecca (2008) XT4 Architecture and Software, available at http://www.nccs.gov/wp-content/training/2008_users_meeting/4-17-08/ using-xt44-17-08.pdf 29

II. HYBRID PROGRAMMING Hybrid Car. Source: http://static.howstuffworks.com/gif/hybrid-car-hyper.jpg 30

II. Hybrid Programming Motivation Considerations MPI threading support Designing hybrid algorithms Examples 31

Motivation Multicore architectures are here to stay Macro scale: distributed memory architecture, suitable for MPI Micro scale: each node contains multiple cores and shared memory, suitable for OpenMP Obvious solution: use MPI between nodes, and OpenMP within nodes Hybrid programming model 32

Considerations Sounds great, Rebecca, but is hybrid programming always better? No, not always Especially if poorly programmed Depends also on suitability of architecture Think of accelerator model in omp parallel region, use power of multicores; in serial region, use only 1 processor If your code can exploit threaded parallelism a lot, then try hybrid programming 33

Considerations: Scalability As number of MPI processes increases, Amdahl s Law hurts more and more Serial portion of code limits scalability A 0.1% serial code cannot have more than 1000x speedup, no matter the number of processes May not matter on small process counts, but 1000x speedup vs. serial on 224,256-core Jaguar? No! Hybrid programming can delay the inevitable On Jaguar, see speedup gains on up to 12,000 vs. 1000 cores (not so embarrassing) 34

Considerations: Changing to Hybrid Model MPI Hybrid Lots of fine-grained parallelism intrinsic in problem Data that could be shared rather than reproduced on each process (e.g., constants, ghost cells) OpenMP Hybrid Want scalability to larger numbers of processors Coarse-grained tasks that could be distributed across machine 35

Considerations: Developing Hybrid Algorithms Hybrid parallel programming model Are communication and computation discrete phases of algorithm? Can/do communication and computation overlap? Communication between threads Communicate only outside of parallel regions Assign a manager thread responsible for inter-process communication Let some threads perform inter-process communication Let all threads communicate with other processes 36

MPI Threading Support MPI-2 standard defines four threading support levels (0) MPI_THREAD_SINGLE only one thread allowed (1) MPI_THREAD_FUNNELED master thread is only thread permitted to make MPI calls (2) MPI_THREAD_SERIALIZED all threads can make MPI calls, but only one at a time (3) MPI_THREAD_MULTIPLE no restrictions (0.5) MPI calls not permitted inside parallel regions (returns MPI_THREAD_SINGLE) this is MPI-1 37

What Threading Model Does My Machine Support? #include <mpi.h> #include <stdio.h> int main(int *argc, char **argv) { MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); printf("supports level %d of %d %d %d %d\n", provided, MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, MPI_THREAD_MULTIPLE); MPI_Finalize(); return 0; } 38

MPI_Init_Thread MPI_Init_thread(int required, int *supported) Use this instead of MPI_Init( ) required: the level of thread support you want supported: the level of thread support provided by implementation (hopefully = required, but if not available, returns lowest level > required; failing that, largest level < required) Using MPI_Init( ) is equivalent to required = MPI_THREAD_SINGLE MPI_Finalize() should be called by same thread that called MPI_Init_thread( ) 39

Other Useful MPI Functions MPI_Is_thread_main(int *flag) Thread calls this to determine whether it is main thread MPI_Query_thread(int *provided) Thread calls to query level of thread support 40

Supported Threading Models: Single Use single pragma #pragma omp parallel { #pragma omp barrier #pragma omp single { MPI_Xyz( ) } #pragma omp barrier } 41

Supported Threading Models: Funneling Cray XT4/5 supports funneling (probably Ranger too?) Use master pragma #pragma omp parallel { #pragma omp barrier #pragma omp master { MPI_Xyz( ) } #pragma omp barrier } 42

What Threading Model Should I Use? Depends on the application! Model Advantages Disadvantages Single Portable: every MPI implementaxon supports this Limited flexibility Funneled Simpler to program Manager thread could get overloaded Serialized Freedom to communicate Risk of too much cross- communicaxon MulXple Completely thread safe Limited availability 43

Designing Hybrid Algorithms Just because you can communicate thread-to-thread, doesn t mean you should Tradeoff between lumping messages together and sending individual messages Lumping messages together: one big message, one overhead Sending individual messages: less wait time (?) Programmability: performance will be great, when you finally get it working! 44

Example: All-to-All Communication 4 quad-core nodes, 16 MPI processes 16 (# procs) x 15 (# msgs/ proc) = 240 individual messages to be sent Communication volume: 16 x 15 x 1 = 240 4 quad-core nodes, 4 MPI processes with 4 threads each 4 (# procs) x 3 (#msgs/ proc) = 12 agglomerated messages (4 x original length) to be sent Communication volume: 4 x 3 x 4 = 48 bandwidth 45

Example: All-to-All Communication 4 quad-core nodes communicating 1 KB updates 16 MPI Processes 4 MPI Processes + 4 OpenMP threads per MPI Process # Messages 16 procs x 15 msgs/proc = 240 messages Message size (KB) Bandwidth required 1 4 4 procs x 3 msgs/proc = 12 messages 240 messages x 1 size = 240 KB 12 messages x 4 size = 48 KB Factor of 5 difference! Factor of 5 difference! 46

Example: Mesh Partitioning Regular mesh of finite elements When we partition mesh, need to communicate information about (domain) adjacent cells to (computationally) remote neighbors 47

48 Example: Mesh Partitioning

Example: Mesh Partitioning Communication Patterns Processor 1 0 1 2 Processor 2 0 1 2 Processor 3 0 1 2 49

Bibliography/Resources von Alfthan, Sebastian, Introduction to Hybrid Programming, PRACE Summer School 2008, URL:http://www.prace-project.eu/hpc-training/prace-summerschool/hybridprogramming.pdf Ye, Helen and Chris Ding, Hybrid OpenMP and MPI Programming and Tuning, Lawrence Berkeley National Laboratory, http://www.nersc.gov/nusers/services/training/classes/nug/ Jun04/NUG2004_yhe_hybrid.ppt Zollweg, John, Hybrid Programming with OpenMP and MPI, Cornell University Center for Advanced Computing, http://www.cac.cornell.edu/education/training/intro/ Hybrid-090529.pdf MPI-2.0 Standard, Section 8.7, MPI and Threads, http://www.mpi-forum.org/docs/mpi-20-html/ node162.htm#node162 50