Evaluating the Portability of UPC to the Cell Broadband Engine

Similar documents
OpenMP on the IBM Cell BE

CellSs Making it easier to program the Cell Broadband Engine processor

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Cell Processor and Playstation 3

OpenMP on the IBM Cell BE

Partitioned Global Address Space (PGAS) Model. Bin Bao

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

New Programming Paradigms: Partitioned Global Address Space Languages

Parallel Numerical Algorithms

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Parallel Computing: Parallel Architectures Jin, Hai

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Accelerated Library Framework for Hybrid-x86

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Unified Parallel C (UPC)

The University of Texas at Austin

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

Relaxed Memory Consistency

high performance medical reconstruction using stream programming paradigms

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Cell Broadband Engine Architecture. Version 1.02

Compilation for Heterogeneous Platforms

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine

Technology Trends Presentation For Power Symposium

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

EE/CSCI 451: Parallel and Distributed Computation

Shared Memory Programming Models I

Introduction to CELL B.E. and GPU Programming. Agenda

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Synchronization

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

EE/CSCI 451: Parallel and Distributed Computation

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

EE382 Processor Design. Processor Issues for MP

The Cache-Coherence Problem

Cell Programming Tips & Techniques

CS516 Programming Languages and Compilers II

Overview: Memory Consistency

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Application Programming

Massively Parallel Architectures

Cell Broadband Engine Architecture. Version 1.0

High Performance Computing. University questions with solution

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

CS 351 Final Review Quiz

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf

Multiprocessors and Locking

Givy, a software DSM runtime using raw pointers

Lecture 32: Partitioned Global Address Space (PGAS) programming models

FVM - How to program the Multi-Core FVM instead of MPI

Review: Creating a Parallel Program. Programming for Performance

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Concurrent Programming with OpenMP

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Portable Parallel Programming for Multicore Computing

Programming Models for Supercomputing in the Era of Multicore

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Optimizing DMA Data Transfers for Embedded Multi-Cores

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

QDP++/Chroma on IBM PowerXCell 8i Processor

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Multiprocessor Systems

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Coherence and Consistency

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

PARALLEL MEMORY ARCHITECTURE

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

An Introduction to Parallel Programming

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

Introduction to Computing and Systems Architecture

CS5460: Operating Systems

Lecture 2. Memory locality optimizations Address space organization

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Bus-Based Coherent Multiprocessors

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Transcription:

Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS

Outline Introduction UPC Cell UPC on Cell Mapping Compiler and Runtime System Optimization: Software Managed Cache Conclusion Porting UPC to Cell 2 Chair for Operating Systems

Introduction UPC Unifed Parallel C (UPC) is a C-language derivate for parallel computing. Mem Mem Mem Mem Mem P1 P2 P3 P4 P1 P2 P3 P4 Network Shared Memory Architecture OpenMP Distributed Memory Architecture MPI Porting UPC to Cell 3 Chair for Operating Systems

Introduction UPC Shared Data int i, j; shared int data [2* n]; // n number of threads... Thread 0 Thread 1 Thread n Global Address Space data[0] data[1] data[n] data[n+1] data[n+2] data[2n] i, j i, j i, j Shared Private Porting UPC to Cell 4 Chair for Operating Systems

Introduction UPC Shared Data: Blocking int i, j; shared [2] int data [2* n]; // n number of threads... Thread 0 Thread 1 Thread n Global Address Space data[0] data[2] data[2n-1] data[1] data[3] data[2n] i, j i, j i, j Shared Private Porting UPC to Cell 5 Chair for Operating Systems

Introduction UPC Shared Data: Blocking int i, j; shared [2] int data [2* n]; // n number of threads... Thread 0 Thread 1 Thread n Global Address Space data[0] data[2] data[2n-1] data[1] data[3] data[2n] i, j i, j i, j Shared Private Shared data can be directly accessed by all threads. UPC takes care for remote accesses. Porting UPC to Cell 5 Chair for Operating Systems

the loop: Introduction UPC Work Sharing upc_forall ( int i = 0; i < n; i ++; i) data [i] = MYTHREAD ; translates to: for ( int i = 0; i < n; i ++) if ( i % THREADS == MYTHREAD ) data [i] = MYTHREAD ; Porting UPC to Cell 6 Chair for Operating Systems

equally: Introduction UPC Work Sharing upc_forall ( int i = 0; i < n; i ++; & data [i]) data [i] = MYTHREAD ; translates to: for ( int i = 0; i < n; i ++) if (& data [ i] is an affine address ) data [i] = MYTHREAD ; Porting UPC to Cell 7 Chair for Operating Systems

Introduction UPC Work Sharing equally: upc_forall ( int i = 0; i < n; i ++; & data [i]) data [i] = MYTHREAD ; translates to: for ( int i = 0; i < n; i ++) if (& data [ i] is an affine address ) data [i] = MYTHREAD ; Distribute work over several threads easily. Utilize data locality. Porting UPC to Cell 7 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication shared [N*P / THREADS ] int a[n][p]; shared [N*M / THREADS ] int c[n][m]; // a and c are row - wise blocked shared matrices shared [M/ THREADS ] int b[p][m]; // column - wise blocking void main ( void ) { int i, j, l; // private variables upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } } Porting UPC to Cell 8 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC Example: Matrix Multiplication = x upc_forall (i = 0 ; i<n ; i ++; &c[i ][0]) { for ( j=0 ; j<m ; j ++) { c[i][j] = 0; for ( l= 0 ; l<p ; l ++) c[i][j] += a[i][l]*b[l][j]; } } Porting UPC to Cell 9 Chair for Operating Systems

Introduction UPC And Some More Stuff Dynamic Shared Memory Allocation Synchronization Barriers Split-Phase Barriers Locks UPC Libraries Collective Library I/O Library... Porting UPC to Cell 10 Chair for Operating Systems

Introduction UPC Memory Consistency shared int flag = 0; shared int data[2]; Thread 0: data [0] = 42; data [1] = 23; flag = 1; Thread 1: while ( flag == 0); data [0] += data [1]; Porting UPC to Cell 11 Chair for Operating Systems

Introduction UPC Memory Consistency The memory consistency model of UPC defines two consistency types: strict: enforces completion of any memory access preceding a strict access, delays any subsequent access until the strict access has been finished relaxed: a sequence of memory accesses which are relaxed is allowed to be reordered as long as interdependencies are respected local view global view relaxed strict Porting UPC to Cell 12 Chair for Operating Systems

Introduction UPC Memory Consistency strict shared int flag = 0; relaxed shared int data[2]; Thread 0: data [0] = 42; data [1] = 23; flag = 1; Thread 1: while ( flag == 0); data [0] += data [1]; Porting UPC to Cell 13 Chair for Operating Systems

Introduction Cell The Cell processor is a hybrid multicore processor: Mem Cell SPE LS SPU... MIC MFC PPE EIB Cache PPU Power Processing Element Cache Power Processing Unit Element Interconnect Bus Memory Interface Controller Synergistic Processing Element Memory Flow Controller Local Store Synergistic Processing Unit Porting UPC to Cell 14 Chair for Operating Systems

Introduction Cell The PPU runs the operating system and administrative tasks. The SPUs handle the parallel workload. The PPU has full transparent access to main memory. The SPUs access their LS. Data have to be transfered from main memory to LS explicitly. Porting UPC to Cell 15 Chair for Operating Systems

UPC on Cell Mapping Mem Mem Mem Mem Mem P1 P2 P3 P4 P1 P2 P3 P4 Network? no direct access to main memory no classical Shared Memory LS too small for distributed data no classical Distributed Memory Mem SPU LS LS SPU SPU LS LS EIB LS LS SPU LS LS PPU SPU SPU SPU SPU Porting UPC to Cell 16 Chair for Operating Systems

UPC on Cell Mapping Solution All shared data resides in main memory. Data is transfered between main memory and LS on demand. Data partitioning has to be maintained to achieve data affinity: data[0] data[3] data[6]... data[1] data[4] data[7]... Porting UPC to Cell 17 Chair for Operating Systems

UPC on Cell Mapping Solution relaxed: on read access: transfer data from main memory to LS blocking transfer on write access: transfer data from LS to main memory non-blocking transfer strict: as before but: usage of DMA barrier and synchronization mechanisms to maintain consistency more expensive Porting UPC to Cell 18 Chair for Operating Systems

UPC on Cell Compiler & Runtime System Berkeley UPC UPC Code Compiler Compiler-generated C Code UPC Runtime System GASNet Communication System Network Hardware Platformindependent Networkindependent Compilerindependent Languageindependent use the Berkeley UPC-to-C compiler implement the UPC runtime layer for the Cell processor Porting UPC to Cell 19 Chair for Operating Systems

Optimization: Software Managed Cache Disadvantages of the Simple Implementation every DMA transfer is parted into aligned 128 byte blocks inefficient bus utilization, retransmission of sequential data 128 bytes Mem ory Local Storage Porting UPC to Cell 20 Chair for Operating Systems

Optimization: Software Managed Cache Disadvantages of the Simple Implementation every DMA transfer is parted into aligned 128 byte blocks inefficient bus utilization, retransmission of sequential data data must be aligned equally in a 16byte grid buffering and copying in LS necessary 16 bytes Mem ory Local Storage Porting UPC to Cell 20 Chair for Operating Systems

Optimization: Software Managed Cache Disadvantages of the Simple Implementation every DMA transfer is parted into aligned 128 byte blocks inefficient bus utilization, retransmission of sequential data data must be aligned equally in a 16byte grid buffering and copying in LS necessary data size restricted to 1,2,4,8, or n*16 bytes transfers of invalid size must be split 16 bytes Memory Local Storage Porting UPC to Cell 20 Chair for Operating Systems

Optimization: Software Managed Cache Disadvantages of the Simple Implementation every DMA transfer is parted into aligned 128 byte blocks inefficient bus utilization, retransmission of sequential data data must be aligned equally in a 16byte grid buffering and copying in LS necessary data size restricted to 1,2,4,8, or n*16 bytes transfers of invalid size must be split So - why not use a cache with 128 byte cache lines? Porting UPC to Cell 20 Chair for Operating Systems

Optimization: Software Managed Cache Cache Line and Cache Frame Load Cache Line to a Cache Frame in LS on data access. 128 bytes Memory Cache in Local Storage Porting UPC to Cell 21 Chair for Operating Systems

Optimization: Software Managed Cache Cache and Consistency relaxed access: can be reordered and thus be cached without synchronization strict access: no cashing to achieve consistency only relaxed accesses to shared data are allowed to be cached this avoids further inter-node communication Porting UPC to Cell 22 Chair for Operating Systems

Optimization: Software Managed Cache Cache Flush Cache must be flushed on synchronization points: strict access, UPC barrier, UPC locks In contrast to classic HW-caches, only dirty bytes are allowed to be transfered to main memory: 128 bytes Memory Local Storage 1 Local Storage 2 Porting UPC to Cell 23 Chair for Operating Systems

Optimization: Software Managed Cache Cache Flush µ-secs 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 number of competing SPEs one transfer two transfers three transfers four transfers five transfers atomic unit Porting UPC to Cell 24 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategy for Reads read access resides data in cache? yes no load cache line to empty cache frame return data Porting UPC to Cell 25 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategy for Reads read access blocking call! resides data in cache? yes no load cache line to empty cache frame return data Porting UPC to Cell 25 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Direct Write Through write access resides data in cache? no use empty cache frame as buffer yes copy data to cache transfer data to main memory Porting UPC to Cell 26 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Direct Write Through write access resides data in cache? no use empty cache frame as buffer yes copy data to cache transfer data to main memory in background Porting UPC to Cell 26 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Bundled Writes With Cache Line Read write access resides data in cache? yes no read cache line to empty cache frame copy data to cache Porting UPC to Cell 27 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Bundled Writes With Cache Line Read write access blocking call! resides data in cache? yes no read cache line to empty cache frame copy data to cache Porting UPC to Cell 27 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Bundled Writes With Cache Line Read On Demand write access resides data in cache? no copy data to empty cache frame yes copy data to cache Porting UPC to Cell 28 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Bundled Writes With Cache Line Read On Demand write access mark cache frame as write only! resides data in cache? no copy data to empty cache frame yes copy data to cache Porting UPC to Cell 28 Chair for Operating Systems

Optimization: Software Managed Cache Cache Strategies for Writes Direct Write Through: advantage: small administrative overhead disadvantage: bus congestion Bundled Writes With Cache Line Read: advantage: bus unloading disadvantage: latency on first write Bundled Writes With Cache Line Read On Demand: advantage: fast write on shared data disadvantage: bus congestion on synchronized cache flush Porting UPC to Cell 29 Chair for Operating Systems

Conclusion What did we already do? We thought about all that very carefully. What will we do next? We are going to implement the proposed architecture. implement several caching strategies. compare and evaluate these strategies. extend our approach to clusters of Cell blades. integrate code partitioning into the system. Porting UPC to Cell 30 Chair for Operating Systems

Conclusion Questions? Porting UPC to Cell 31 Chair for Operating Systems