Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Similar documents
Sequoia. Mattan Erez. The University of Texas at Austin

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

CellSs Making it easier to program the Cell Broadband Engine processor

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Programming the Memory Hierarchy Revisited: Supporting Irregular Parallelism in Sequoia

Adaptive Scientific Software Libraries

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Massively Parallel Architectures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Introduction to CELL B.E. and GPU Programming. Agenda

Lecture 2. Memory locality optimizations Address space organization

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Cell Processor and Playstation 3

The University of Texas at Austin

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

CS560 Lecture Parallel Architecture 1

high performance medical reconstruction using stream programming paradigms

Behavioral Data Mining. Lecture 12 Machine Biology

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Lecture 3: Intro to parallel machines and models

CONSOLE ARCHITECTURE

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Chapel Introduction and

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

How to Write Fast Code , spring th Lecture, Mar. 31 st

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

A Case for High Performance Computing with Virtual Machines

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Cache-oblivious Programming

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

What s New with GPGPU?

Parallelism in Spiral

2008 International ANSYS Conference

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Trends in HPC (hardware complexity and software challenges)

Parallel Languages: Past, Present and Future

Dealing with Heterogeneous Multicores

Overview of research activities Toward portability of performance

General Purpose GPU Computing in Partial Wave Analysis

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Parallel Computing. Hwansoo Han (SKKU)

Accelerating CFD with Graphics Hardware

The Hierarchical SPMD Programming Model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Evaluating the Portability of UPC to the Cell Broadband Engine

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Technology Trends Presentation For Power Symposium

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Uniprocessor Computer Architecture Example: Cray T3E

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Parallel Exact Inference on the Cell Broadband Engine Processor

Linear Algebra for Modern Computers. Jack Dongarra

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Parallel and Distributed Programming Introduction. Kenjiro Taura

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Cache Oblivious Algorithms

Interfacing Chapel with traditional HPC programming languages

Task Superscalar: Using Processors as Functional Units

Placement de processus (MPI) sur architecture multi-cœur NUMA

High Performance Computing on GPUs using NVIDIA CUDA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

8. Hardware-Aware Numerics. Approaching supercomputing...

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

8. Hardware-Aware Numerics. Approaching supercomputing...

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

Technology for a better society. hetcomp.com

High Performance Computing and GPU Programming

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Pedraforca: a First ARM + GPU Cluster for HPC

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Octopus: A Multi-core implementation

Parallel Hybrid Computing Stéphane Bihan, CAPS

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CSE 392/CS 378: High-performance Computing - Principles and Practice

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Transcription:

Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with an easy one today: Sequoia 1 2 A Point of View Parallelism is relatively easy Not hard to find lots of parallelism in many apps Sequoia Language: stream programming for machines with deep memory hierarchies The hard part is communication Compute is easy More difficult to ensure data is where it is needed Idea: Expose abstract memory hierarchy to programmer Implementation: benchmarks run well on many multi-level machines Cell, PCs, clusters of PCs, cluster of PS3s, also + disk, GPUs 3 4 1

Locality Structure algorithms as collections of independent and locality cognizant computations with well-defined working sets. Locality Structure algorithms as collections of independent & locality cognizant computations with well-defined working sets. This structuring may be done at any scale. Keep temporaries in registers Cache/scratchpad blocking Message passing on a cluster Out-of-core algorithms Efficient programs exhibit this structure at many scales. 5 6 Sequoia s Goals Locality in Programming Languages Facilitate development of locality-aware programs that remain portable across machines Provide constructs that can be implemented efficiently Place computation and data in machine Explicit parallelism and communication Large bulk transfers 7 Local (private) vs. global (remote) addresses MPI (via message passing) UPC, Titanium Domain distributions map array elements to locations HPF, UPC, Titanium, ZPL X10, Fortress, Chapel Focus on communication between nodes Ignore hierarchy within a node 8 2

Locality in Programming Languages Streams and kernels Stream data off chip. Kernel data on chip. StreamC/KernelC, Brook GPU shading (Cg, HL) Blocked Matrix Multiplication void matmul_( int M, int N, int T, float* A, float* B, float* C) for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] * B[k][j]; matmul_ A B C Architecture specific Only represent two levels 9 10 Blocked Matrix Multiplication Blocked Matrix Multiplication void matmul_l2( int M, int N, int T, float* A, float* B, float* C) Perform series of matrix multiplications. matmul_l2 256x256 void matmul( int M, int N, int T, float* A, float* B, float* C) Perform series of L2 iplications. matmul large A B C A B C... matmul_l2 256x256... matmul_l2 256x256... matmul_ matmul_ matmul_ 512 calls matmul_ matmul_ matmul_ matmul_... matmul_ matmul_ matmul_ matmul_... matmul_ 11 12 3

Hierarchical Memory Abstract machines as trees of memories Dual-core PC Hierarchical Memory Main memory Similar to: Parallel Memory Hierarchy Model (Alpern et al.) 13 14 Hierarchical Memory Hierarchical Memory Abstract machines as trees of memories Dual-core PC 4 node cluster of PCs Single Cell blade Main memory Aggregate cluster memory (virtual level) Main memory L2 cache Node memory Node memory Node memory Node memory cache cache L2 cache cache L2 cache cache L2 cache cache L2 cache cache 15 16 4

Hierarchical Memory Dual Cell blade Main memory Hierarchical Memory System with a GPU Main memory GPU memory (No memory affinity modeled) 17 18 Hierarchical Memory Cluster of dual Cell blades Aggregate cluster memory (virtual level) Tasks Main memory Main memory 19 20 5

Sequoia Tasks Special functions called tasks are the building blocks of Sequoia programs task matmul::leaf( in float A[M][T], for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] * B[k][j]; 21 Sequoia Tasks Task args & temporaries define working set Task working set resident at single location in abstract machine tree task matmul::leaf( in float A[M][T], for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] * B[k][j]; 22 Sequoia Tasks (Cont.) Sequoia parameter passing semantics are not Call by value Call by name Rather Copy-in, copy-out Or Call-by-value-result Expresses the communication of arguments and results 23 Sequoia Tasks (Cont.) A task says what is copied Not how it is copied The latter is machine dependent File operations for a disk MPI operations for a cluster DMAs for Cell processor 24 6

Task Hierarchies Task Hierarchies task matmul::inner( in float A[M][T], tunable int P, Q, R; Recursively call matmul task on submatrices of A, B, and C of size PxQ, QxR, and PxR. task matmul::inner( in float A[M][T], tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) mapseq( int k=0 to T/Q ) matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); task matmul::leaf( in float A[M][T], for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] Prof. Aiken * B[k][j]; CS 315B Lecture 9 25 task matmul::leaf( in float A[M][T], for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] * B[k][j]; 26 Task Hierarchies task matmul::inner( in float A[M][T], tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) mapseq( int k=0 to T/Q ) matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); task matmul::leaf( in float A[M][T], for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0; k<t; k++) C[i][j] += A[i][k] * B[k][j]; Calling task: matmul::inner Located at level X A B C A B C Callee task: matmul::leaf Located at level Y27 Task Hierarchies task matmul::inner( in float A[M][T], tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) mapseq( int k=0 to T/Q ) matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); Tasks express multiple levels of parallelism 28 7

Leaf Variants Summary: Sequoia Tasks Be practical: Permit platform-specific kernels task matmul::leaf(in float A[M][T], inout float C[M][N]) for (int i=0; i<m; i++) for (int j=0; j<n; j++) for (int k=0;k<t; k++) C[i][j] += A[i][k] * B[k][j]; task matmul::leaf_cblas(in float A[M][T], inout float C[M][N]) cblas_sgemm(a, M, T, B, T, N, C, M, N); 29 Single abstraction for Isolation / parallelism Explicit communication / working sets Expressing locality Sequoia programs describe hierarchies of tasks Parameterized for portability 30 A Task Call Generalizing Tasks 31 32 8

Abstraction vs. Reality The task hierarchy is abstract Mapping A task may have an unspecified number of sub-tasks The number of levels of sub-tasks may be unspecified 33 Actual machines have limits in both dimensions 34 Machine Descriptions Mappings A separate file describes each machine The number of levels of memory hierarchy The amount of memory at each level The number of processors at each level This file is written once per machine Use for each program compiled for that machine 35 A mapping file says how a particular program is mapped on to a specific machine Settings for tunables Degree of parallelism for each level Whether to software pipeline compute/communication control(level 0) loop k[0] spmd fullrange = 0.6; ways = 6; iterblk = 1; 36 9

Compilation Overview The Sequoia compiler takes A Sequoia program A mapping file A machine description Generates code for All levels of the memory hierarchy Glue to pass/return task arguments using appropriate communication primitives 37 Mapping Summary The abstract program must be made concrete for a particular machine Separate machine-specific parameters into: Information that is common across programs Machine descriptions Information specific to a machine-program pair Mapping files Mapping files can be (partially) automated 38 Sequoia Benchmarks Performance Results Blas Level 1 SAXPY, Level 2 SGEMV, and Level Linear 3 SGEMM Algebra benchmarks 2D single precision convolution with 9x9 support Conv2D (non-periodic boundary constraints) Complex single precision FFT 100 time steps of N-body stellar dynamics simulation (N 2 ) single precision Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper) 39 40 10

Single Runtime System Configurations Resource Utilization IBM Cell Scalar 2.4 GHz Intel Pentium4 Xeon, 1GB 8-way SMP 4 dual-core 2.66GHz Intel P4 Xeons, 8GB Disk 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk Cluster 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s) Cell 3.2 GHz IBM Cell blade (1 Cell 8 SPE), 1GB PS3 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable) 100 0 Resource Utilization (%) Bandwidth utilization Compute utilization 41 42 Single Runtime Configurations - GFlop/s Scalar SMP Disk Cluster Cell PS3 SAXPY 0.3 0.7 0.007 4.9 3.5 3.1 SGEMV 1.1 1.7 0.04 12 12 10 SGEMM 6.9 45 5.5 91 119 94 CONV2D 1.9 7.8 0.6 24 85 62 SGEMM Performance Cluster Intel Cluster MKL: 101 GFlop/s Sequoia: 91 GFlop/s SMP Intel MKL: 44 GFlop/s Sequoia: 45 GFlop/s FFT3D 0.7 3.9 0.05 5.5 54 31 GRAVITY 4.8 40 3.7 68 97 71 HMMER 0.9 11 0.9 12 12 7.1 43 44 11

FFT3D Performance Best Known Implementations Cell Mercury Computer: 58 GFlop/s FFTW 3.2 alpha 2: 35 GFlop/s Sequoia: 54 GFlop/s Cluster FFTW 3.2 alpha 2: 5.3 GFlop/s Sequoia: 5.5 GFlop/s SMP FFTW 3.2 alpha 2: 4.2 GFlop/s Sequoia: 3.9 GFlop/s 45 46 HMMer ATI X1900XT: 9.4 GFlop/s (Horn et al. 2005) Sequoia Cell: 12 GFlop/s Sequoia SMP: 11 GFlop/s Gravity Grape-6A: 2 billion interactions/s (Fukushige et al. 2005) Sequoia Cell: 4 billion interactions/s Sequoia PS3: 3 billion interactions/s Out-of-core Processing Out-of-core Processing Scalar Disk Scalar Disk SAXPY 0.3 0.007 SGEMV 1.1 0.04 SGEMM 6.9 5.5 CONV2D 1.9 0.6 FFT3D 0.7 0.05 GRAVITY 4.8 3.7 SAXPY 0.3 0.007 SGEMV 1.1 0.04 SGEMM 6.9 5.5 CONV2D 1.9 0.6 FFT3D 0.7 0.05 GRAVITY 4.8 3.7 Some applications have enough computational intensity to run from disk with little slowdown HMMER 0.9 0.9 HMMER 0.9 0.9 47 48 12

Cluster vs. PS3 Cluster PS3 SAXPY 4.9 3.1 SGEMV 12 10 SGEMM 91 94 CONV2D 24 62 FFT3D 5.5 31 GRAVITY 68 71 Cost Cluster: $150,000 PS3: $499 Multi-Runtime System Configurations Cluster of SMPs Four 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak) Disk + PS3 Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s Two Sony Playstation 3 s connected via GigE (60MB/s peak) HMMER 12 7.1 49 50 SMP vs. Cluster of SMP SMP vs. Cluster of SMP Cluster of SMPs SMP SAXPY 1.9 0.7 SGEMV 4.4 1.7 SGEMM 48 45 CONV2D 4.8 7.8 FFT3D 1.1 3.9 GRAVITY 50 40 Cluster of SMPs SMP SAXPY 1.9 0.7 SGEMV 4.4 1.7 SGEMM 48 45 CONV2D 4.8 7.8 FFT3D 1.1 3.9 GRAVITY 50 40 Same number of total processors Compute limited applications agnostic to interconnect HMMER 14 11 HMMER 14 11 51 52 13

PS3 Cluster as a Compute Platform? Autotuner Results PS3 Cluster PS3 SAXPY 5.3 3.1 SGEMV 15 10 SGEMM 30 94 CONV2D 19 62 FFT3D 0.36 31 GRAVITY 119 71 Cell Cluster of PCs Cluster of PS3s auto hand other auto hand other auto hand conv2d 99.6 85 26.7 24 20.7 19 sgemm 137 119 92.4 90 101(MKL) 33.4 30 fft3d 57 54 39(FFTW) 46.8(IBM) 5.5 5.5 5.3(FFTW) 0.57 0.36 53 HMMER 13 7.1 Prof. Aiken CS 54 315B Lecture 9 Sequoia Summary Problem: Deep memory hierarchies pose perf. programming challenge Memory hierarchy different for different machines Solution: Hierarchical memory in the programming model Program the memory hierarchy explicitly Expose properties that affect performance Approach: Express hierarchies of tasks Execute in local address space Call-by-value-result semantics exposes communication Parameterized for portability 55 Sequoia vs. Regent Sequoia is static Mapping, task & data hierarchy Structured data only One hierarchy for tasks & data Machine model is a memory hierarchy Regent is dynamic Structure & unstructured data Separate task/data hierarchies And multiple hierarchies for data! Incurs additional complexity E.g., privileges Machine is a graph of processors & memories 56 14