( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Similar documents
Cell Processor and Playstation 3

CellSs Making it easier to program the Cell Broadband Engine processor

Evaluating the Portability of UPC to the Cell Broadband Engine

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

OpenMP on the IBM Cell BE

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Parallel Exact Inference on the Cell Broadband Engine Processor

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

OpenMP on the IBM Cell BE

Massively Parallel Architectures

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Technology Trends Presentation For Power Symposium

High Performance Computing. University questions with solution

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

The University of Texas at Austin

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

All About the Cell Processor

Crypto On the Playstation 3

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CELL B.E. and GPU Programming. Agenda

Amir Khorsandi Spring 2012

Software Development Kit for Multicore Acceleration Version 3.0

Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors

Spring 2011 Prof. Hyesoon Kim

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

QDP++/Chroma on IBM PowerXCell 8i Processor

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

Introducing OTF / Vampir / VampirTrace

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Potentials and Limitations for Energy Efficiency Auto-Tuning

high performance medical reconstruction using stream programming paradigms

A Buffered-Mode MPI Implementation for the Cell BE Processor

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

Cell Broadband Engine Architecture. Version 1.0

Introduction to Computing and Systems Architecture

Scheduling on Asymmetric Parallel Architectures

Cell Programming Tips & Techniques

Optimizing Assignment of Threads to SPEs on the Cell BE Processor

High-Performance Modular Multiplication on the Cell Broadband Engine

[Scalasca] Tool Integrations

Cell Broadband Engine Overview

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

Acceleration of Correlation Matrix on Heterogeneous Multi-Core CELL-BE Platform

Department of Computer Science. Chair of Computer Architecture. Diploma Thesis. Execution of SPE code in an Opteron-Cell/B.E.

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Programming for Fujitsu Supercomputers

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

IBM Research Report. SPU Based Network Module for Software Radio System on Cell Multicore Platform

Cell SDK and Best Practices

Interconnection of Clusters of Various Architectures in Grid Systems

Phylogenetic Inference on Cell Processors

QDP++ on Cell BE WEI WANG. June 8, 2009

Parallel Exact Inference on the Cell Broadband Engine Processor,

A Streaming Computation Framework for the Cell Processor. Xin David Zhang

Modeling Advanced Collective Communication Algorithms on Cell-based Systems

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

IBM Research Report. Mesoscale Performance Simulations of Multicore Processor Systems with Asynchronous Memory Access

A Synchronous Mode MPI Implementation on the Cell BE Architecture

A Brief View of the Cell Broadband Engine

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Tracing and Visualization of Energy Related Metrics

Instrumentation. BSC Performance Tools

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

Performance Analysis with Vampir

INF5063: Programming heterogeneous multi-core processors Introduction

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

Interval arithmetic on the Cell processor

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

An MPI Performance Monitoring Interface for Cell Based Compute Nodes.

Vampir and Lustre. Understanding Boundaries in I/O Intensive Applications

ClearSpeed Visual Profiler

Compilation for Heterogeneous Platforms

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

How to Write Fast Code , spring th Lecture, Mar. 31 st

Parallel Implementation of Irregular Terrain Model on IBM Cell Broadband Engine

A Comparison of Programming Models for Multiprocessors with Explicitly Managed Memory Hierarchies

Bruno Pereira Evangelista

Streaming Model Based Volume Ray Casting. Implementation for Cell Broadband Engine

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Blue-Steel Ray Tracer

Chapel Introduction and

Porting Financial Market Applications to the Cell Broadband Engine Architecture

Transcription:

( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg

Cell Broadband Engine Processor offers vast resources SPEs: SIMD-Cores for fast calculations, 256 KB local store (LS, software controlled), dedicated DMA ( MFC ) engine PPE: very simple PowerPC Core for OS (Linux) and control tasks Sophisticated architecture results in complex software development process Different compilers and programs for PPE and SPEs SPEs use DMA commands to access main memory or LS of other SPEs, asynchronous execution by MFC Mailbox communication between PPE and SPEs SPU LS PowerPC Core PowerPC Processor Element (PPE) Tool support for software development and performance analysis required L2 SPU LS L1 SPU LS SPU LS SPU LS Element Interconnect Bus (EIB) Memory Interface Controller (MIC) Dual XDR SPU LS SPU LS SPE SPU LS Bus Interface Controller (BIC) FlexIO Daniel Hackenberg 2

Software Tracing and Vampir Proven method for analysis of complex programs Instrumented target application creates events with timestamps at runtime Events are stored in traces, trace analysis e.g. by visualization VampirTrace: open source trace monitor Supports MPI, OpenMP, regions, hardware counters Creates program traces in Open Trace Format (OTF) Vampir: visualization and analysis of trace data Various displays (e.g. timelines) and means for statistical analysis Parallel version supports ultra large program traces Daniel Hackenberg 3

Software Tracing on Cell/B.E. Systems PPE Conventional tools with PowerPC support run unmodified Modifications necessary to support SPE threads SPE New concept needs to be designed, suitable for this architecture New monitor necessary to generate events Local store too small, only temporary storage of events Synchronization of PPE and SPE timers necessary Daniel Hackenberg 4

Trace Concept for the Cell B./E. Architecture Daniel Hackenberg 5

( 1 ) Trace Visualization for Cell Location Process 1 Region 1 Region 2 Process 2 Region 1 Region 2 Process 3 Region 1 Region 2 Process 4 Region 1 Region 2 Time Illustration of parallel processes in a classic timeline display Daniel Hackenberg 6

( 2 ) Trace Visualization for Cell Location PPE Process 1 SPE Thread 1 Region 1 Region 2 SPE Thread 2 Region 1 Region 2 SPE Thread 3 Region 1 Region 2 Time Illustration of SPE threads as children of the PPE process Daniel Hackenberg 7

( 3 ) Trace Visualization for Cell Location PPE Process 1 SPE Thread 1 Region 1 SPE Thread 2 Region 1 SPE Thread 3 Region 1 Time Illustration of mailbox messages ( send/receive ) Classic two-sided communication Illustrated by lines similar to MPI messages Daniel Hackenberg 8

( 4 ) Trace Visualization for Cell Location PPE Process 1 Main Memory read read write SPE Thread 1 Region 1 SPE Thread 2 Region 1 Time Illustration of DMA transfers between SPEs and main memory Virtual process bar represents the main memory ( read/write ) Illustration of main memory state possible Daniel Hackenberg 9

( 5 ) Trace Visualization for Cell Location PPE Process 1 Main Memory SPE Thread 1 SPE Thread 2 DMA get DMA put Time DMA transfers between SPEs Classic send/receive representation unsuitable Additional line allows distinction of active and passive partner Daniel Hackenberg 10

( 6 ) Trace Visualization for Cell Location PPE Process 1 Main Memory SPE Thread 1 SPE Thread 2 DMA wait t 0 t 1 t 2 Time t_0 = get_timestamp(); mfc_get(); [...] t_1 = get_timestamp(); wait_for_dma_tag(); t_2 = get_timestamp(); DMA wait operation creates two events ( 2 (at t 1 and t Allows illustration of DMA wait time Similar for mailbox messages Daniel Hackenberg 11

Implementation Beta version VampirTrace (VT) with Cell support Open Source trace monitor Compiler wrappers (vtcc and vtspucc) will do most of the work for you Header files for PPE and SPE programs: Instrumentation of inline functions provided by the Cell SDK Manual instrumentation of important SPE code regions for low overhead Tracing of hybrid Cell/MPI parallel applications supported Trace analyzer Vampir: Technology study with support for Cell traces available Daniel Hackenberg 12

( 1 ) Trace Visualization with Vampir Visualization of a Cell trace using Vampir Demo program using 4 SPEs Daniel Hackenberg 13

( 2 ) Trace Visualization with Vampir Daniel Hackenberg 14

( 3 ) Trace Visualization with Vampir Daniel Hackenberg 15

( 4 ) Trace Visualization with Vampir Daniel Hackenberg 16

( 5 ) Trace Visualization with Vampir Complex DMA transfers of SPE 3 Daniel Hackenberg 17

( 1 ) Tracing Complex Cell Applications: FFT FFT at synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS Daniel Hackenberg 18

( 2 ) Tracing Complex Cell Applications: FFT FFT at synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS Daniel Hackenberg 19

( 3 ) Tracing Complex Cell Applications: FFT FFT at synchronization point 8 SPEs, 16 MByte page size, 42.9 GFLOPS Daniel Hackenberg 20

Tracing Hybrid Cell/MPI Applications: PBPI PBPI (Parallel Bayesian Phylogenetic Inference) ( processors on 3 QS21 blades (6 Cell Daniel Hackenberg 27

Cell Tracing Overhead Overhead sources Creating events Transferring trace data from the SPEs to main memory Trace buffer und trace library use space in local store (< 12 KByte) Additional overhead (VampirTrace initialization and processing of SPE event data) outside of SPE runtime Analysis unaffected Experimental overhead measurements (QS21, 8 SPEs) Original (GFLOPS) Tracing (GFLOPS) Overhead SGEMM 203,25 200,73 1,3 % FFT 11,93 11,85 0,7 % Cholesky, SPOTRF 143,17 139,32 2,8 % Cholesky, DGEMM 4,48 4,10 9,2 % (*) Cholesky, STRSM 5,73 5,64 1,7 % (*) Increased overhead due to intense usage of DMA lists Trace overhead without DMAs: 1,4 % Daniel Hackenberg 28

Summary & Future Work Concept for software tracing on Cell systems presented VampirTrace Beta with Cell support, typical overhead < 5 percent Visualization of traces with Vampir Creates invaluable insight into the runtime behavior of Cell applications Intuitive performance analysis and optimization Support for large, hybrid Cell/MPI applications Future work may include: Improved tracing, e.g. by providing additional analysis features such as alignment checks Improved visualization, e.g. by colorizing DMA messages (tag, size or bandwidth), displaying intensity of main memory accesses Daniel Hackenberg 29