On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms

Similar documents
IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Re-architecting Virtualization in Heterogeneous Multicore Systems

Flexible Architecture Research Machine (FARM)

Processor Architecture and Interconnect

CEC 450 Real-Time Systems

High Performance Computing. University questions with solution

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Chapter 6 Storage and Other I/O Topics

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

The University of Texas at Austin

The Exascale Architecture

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Scalable Distributed Memory Machines

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

New Communication Standard Takyon Proposal Overview

Parallel and Distributed Computing

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Modern Processor Architectures. L25: Modern Compiler Design

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

CSE 544: Principles of Database Systems

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Hybrid MPI - A Case Study on the Xeon Phi Platform

Designing Shared Address Space MPI libraries in the Many-core Era

Summarizer: Trading Communication with Computing Near Storage

Modeling and SW Synthesis for

asoc: : A Scalable On-Chip Communication Architecture

Compilation for Heterogeneous Platforms

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Arquitecturas y Modelos de. Multicore

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

M 3 Microkernel-based System for Heterogeneous Manycores

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Message-Passing Shared Address Space

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Parallel Computing. Hwansoo Han (SKKU)

A Buffered-Mode MPI Implementation for the Cell BE Processor

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

KeyStone C665x Multicore SoC

VEOS high level design. Revision 2.1 NEC

Research on the Implementation of MPI on Multicore Architectures

Evaluating the Portability of UPC to the Cell Broadband Engine

Parallel Computing: Parallel Architectures Jin, Hai

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

A unified multicore programming model

Dongjun Shin Samsung Electronics

Engineer-to-Engineer Note

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Outline. Limited Scaling of a Bus

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Execution architecture concepts

Lars Schor, and Lothar Thiele ETH Zurich, Switzerland

Advanced Computer Networks. End Host Optimization

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

Exploiting Offload Enabled Network Interfaces

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Gables: A Roofline Model for Mobile SoCs

Understanding and Improving the Cost of Scaling Distributed Event Processing

MTAPI: Parallel Programming for Embedded Multicore Systems

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Simulink -based Programming Environment for Heterogeneous MPSoC

Blocking SEND/RECEIVE

Logical Partitions on Many-core Processors

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Near-Data Processing for Differentiable Machine Learning Models

A brief introduction to OpenMP

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Performance evaluation of Linux CAN-related system calls

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

jverbs: Java/OFED Integration for the Cloud

Computer Architecture

An Introduction to Parallel Programming

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Parallel Architecture. Hwansoo Han

ECE 8823: GPU Architectures. Objectives

Reconfigurable Computing. Introduction

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

DRAM Tutorial Lecture. Vivek Seshadri

Optimizing DMA Data Transfers for Embedded Multi-Cores

CellSs Making it easier to program the Cell Broadband Engine processor

W4118: OS Overview. Junfeng Yang

Spring 2011 Prof. Hyesoon Kim

TABLE OF CONTENTS 1. INTRODUCTION 1.1. PREFACE KEY FEATURES PERFORMANCE LIST BLOCK DIAGRAM...

GPUfs: Integrating a file system with GPUs

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI

Performance Evaluation of Inter-Processor Communication for an Embedded Heterogeneous Multi-Core Processor *

Transcription:

On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms Shih-Hao Hung, Po-Hsun Chiu, Chia-Heng Tu, Wei-Ting Chou and Wen-Long Yang Graduate Institute of Networking and Multimedia Department of Computer Science and Information Engineering National Taiwan University

Outline Motivation The MeSsaGe (MSG) passing library Design for Performance Experimental Results Summary

Motivation Heterogeneous multi-core architectures have gained popularity for embedded systems Power efficiency High performance No de facto solution to inter-core communications for such multicore chips Platform dependent communication mechanisms Qualcomm Snapdragon, TI DaVinci, IBM Cell, ITRI PACDuo Developers have to learn new API constantly Hurts the portability of multicore applications

Message Passing Model MPI vs. MCAPI MPI 2.1: For inter-computer communication Communication Point-to-point, Collective operations, One-sided communications Synchronization Implicit synchronization included in synchronous communications Additional features E.g., Parallel I/O, Process and group management MCAPI: Overkill Proposed by the Multicore Association in 2008 Designed for embedded multicore platforms with emphasis on network-on-chips Functionalities: Messages Not and widely channel adopted APIs Reference implementations by the for HW the vendors x86/powerpc systems 4

The MSG Library

MeSsaGe Passing (MSG) Library A subset of collective operations One-sided communication MPI MSG MCAPI Channel API Rest of advanced MPI features: Other variation of point-to-point communication, Other collective operations, etc Blocking & non-blocking message-passing Following MPI s interface and MCAPI s philosophy MPI has been already a mature and popular interface for parallel computing Better compatibility and portability 6

Software Architecture Application layer Management functions Point-to-point communication Collective communication One-sided communication Core layer Resource management Shared memory architecture Internal Protocols Distributed memory architecture Porting layer OS dependent configuration functions Management functions: init, exit, comm_rank, comm_size Resource management: Managing internal data structures for data transmission OS, platform dependent functions: mmap, shmget, DMA, memcpy, etc Platform dependent configuration functions

Supported MSG APIs Three types of communication protocols: Point-to-point communication Blocking & Non-blocking send/recv Collective communication barrier One-to-many & many-to-one One-side communication Put & get

Design for Performance

Important Considerations 1. Placement of relay buffer 2. Buffer management and lock-free design 3. Data movement mechanism

Part I: Placement of Relay Buffers (1/3) A intermediate buffer for relaying the data? For blocking communication Not required for synchronous communication Simple, efficient and save memory space For non-blocking communication Buffing mechanism becomes necessary: decouple executions of sender and receiver reduce wait time of the sender for start and termination of data transmission avoid security problem by isolating sender/receiver Complex protocol handling, longer latency, etc. 11

Part I: Placement of Relay Buffers (2/3) Typically, there are several memory modules available on the systems With different capacities and access latencies Characterizing communication performancefor different buffer placement schemes is necessary

Part I: Placement of Relay Buffers (3/3) ITRI PAC Duo Platform: ARM + 2 PAC DSPs Three memory modules available for PAC DSP DSP-side DRAM, SRAM, Local data memory Place the messages on the different locations according to their sizes Communication latency (ns) from ARM to DSP with different types of relay buffers 128MB 128KB 64KB

Part II: Buffer Management & Lock-free Design Scalability affected by memory usage & lock-free design 1-to-1 buffering: Better to avoid lock handling One receive buffer (RB) for each pair of core connections N*(N-1) RBs are required for a N-core system But need to minimize the memory usage at the same time One receive buffer per core The lock handling will impact the performance RB 10 RB 01 RB 20 RB 21 RB 30 RB 31 Core 0 Core 1 RB 02 RB 03 RB 12 RB 13 RB 32 RB 23 Core 2 Core 3 lock lock RB RB Core 0 Core 1 lock lock RB RB Core 2 Core 3

Part II: Proposed Buffering Schemes (1/2) Buffer Pool scheme*: Request Queues (RQ): Store small-size control messages Relay Buffer Pool: Allows connections to share buffers in the pool Service Processor scheme: A dedicated core to serve as a service processor Forward the request for the sender to the receiver Reduce the number of required RQ, but require longer comm. latency (further optimization can improve the performance) Relay Buffer Pool RQ(Core1) RQ(Core2) RQ(Core3) Core 0 Relay Buffer Pool Relay Buffer Pool RQ(Core0) RQ(Core2) RQ(Core3) Core 1 Relay Buffer Pool Service Processor RFQ 0 RFQ 1... Core 0 Relay Buffer Pool... RQ Core N-1 N Relay Buffer Pool RQ(Core0) RQ(Core1) RQ(Core0) RQ(Core1) RFQ N- 1 RQ RQ(Core3) Core 2 RQ(Core2) Core 3

Part II: Proposed Buffering Schemes (2/2) Buffer Pool scheme*: Request Queues (RQ): Store small-size control messages Relay Buffer Pool: Allows connections to share buffers in the pool Service Processor scheme: A dedicated core to serve as a service processor Forward the request for the sender to the receiver Reduce the number of required RQ, but require longer comm. latency (further optimization can improve the performance) The memory footprint of 1-to to-1 buffering, buffer pool and service processor approach (MB) #core 2 4 8 16 32 64 128 1-to to-1 buffering 0.125 0.75 3.5 15 62 252 1016 Buffer Pool 0.13 0.25 0.52 1.11 2.48 5.97 15.93 Service Processor 0.13 0.25 0.5 1.01 2.03 4.06 8.13

Part II: Reduce Memory Footprint Buffer Pool Request Queues (RQ): Store small-size control messages Relay Buffer Pool: Allows connections to share buffers in the pool One relay buffer pool and N-1 request queues per core N*(N-1) request queues + N relay buffer pools for a N-core system Relay Buffer Pool Relay Buffer Pool RQ(Core1) RQ(Core2) RQ(Core3) Core0 Relay Buffer Pool RQ(Core0) RQ(Core2) RQ(Core3) Core1 Relay Buffer Pool RQ(Core0) RQ(Core1) RQ(Core3) Core2 RQ(Core0) RQ(Core1) RQ(Core2) Core3 17

Part II: Alternative Solution Service Processor Buffer pool approach needs N-1 request queues for each core Still takes a lot of on-chip memory space Service Processor approach: Delegate a core to serve as the service processor (HW or SW design) Service processor has N Request Forward Queue(RFQ) Each core has one relay buffer pool and one request queue 2*N request queues + N buffer pools in total Service Processor RFQ 0 Core0 Relay Buffer Pool RQ RFQ 1 RFQ N-1...... CoreN-1 Relay Buffer Pool RQ 18

Part III: Data Movement Mechanism (1/2) Several ways to move the data: Processor: via the memory copy operations DMA engine: via I/O processor, or peripheral device It is common that DMA engine is often used to ease the burden on the processor Processor: move small pieces of data quickly DMA engine: better for large messages What is considered to be smallor large?

Part III: Data Movement Mechanism (2/2) Determining the threshold for different data transmission schemes: E.g., on the PAC Duo, ARM/DSP can move data between the two memory modules E.g., EMDMA saves ~25% of time to move 64B data between SRAM and DRAM The latencies for moving large data blocks between 4 bytes to 16KB using memcpy(via DSP) and EMDMA on the PAC Duo platform 31.6x speedup

Experimental Results

Data Parallel Application on IBM CELL for Scalability Test RC5: A block cipher Encryption and Decryption PPE (master) + 8 SPEs (workers) fork join Three implementations on Cell: IBM Cell Intrinsic (baseline) DMA operations performed by hardware Message-Passing w/ MSG library (send/recv) Require both sender and receiver to participate One-sided Communication w MSG library (put/get) Introduced to improve the performance on shared-memory

Scalability of MSG Communication schemes affects performance & scalability Programming ability vs. performance Scalability among three implementations Cell Intrinsic: 100% Good scalability DMA hardware overlaps communication and computation Send-Recv: 85%~57% Acceptable scalability PPE becomes bottleneck Put-Get: 99%~91% Good scalability Take advantage of DMA The overhead is caused by extra efforts made for handling data size and alignment problems Speedup 8 7 6 5 4 3 2 1 1 SPE 2 SPE 4 SPE 8 SPE Encryption Number of SPE's Cell Intrinsic msg_send/msg_recv msg_put/msg_get

Pipelined Execution Scheme on ITRI PAC Duo Streaming multimedia application The handheld device will: Receive encrypted data stream, Decrypt received data, Decode and display the data

Pipelined Execution Scheme on ITRI PAC Duo (2/2) Inter-core communications are done by MSG onesided functions Double buffering schemes allow overlapped communication and computation on the DSP cores Experimented with two different configurations: Solo ARM: 2.2 frames/second Offload SSL/JPEG for pipelined execution: 14.6 frames/second We are exploring different offloading schemes

Summary

Summary We share the experiences for designing the MSG library for embedded multicore systems We implemented MSG on several platforms: IBM CELL, ITRI PAC Duo, TI Davinci, and x86/64 The scalability and applicability of the MSG are presented with case studies Currently, the MSG implementation is available on the OpenFoundry; any feedback is welcomed