OpenMP for Heterogeneous Multicore Embedded Systems using MCA API standard interface

Size: px

Start display at page:

Download "OpenMP for Heterogeneous Multicore Embedded Systems using MCA API standard interface"

Alan Lucas
5 years ago
Views:

edu) Peng Sun, Suyang zhu, Barbara Chapman HPCTools Group, University of

1 OpenMP for Heterogeneous Multicore Embedded Systems using MCA API standard interface Sunita Chandrasekaran Peng Sun, Suyang zhu, Barbara Chapman HPCTools Group, University of Houston, USA in collaboration with freescale semiconductor (FSL) and Semiconductor Research Corporation (SRC) OpenMP SC14

2 Real- world applications using embedded systems FPGAS used to receive a number of acoustic pings form an image DSP, a device to interact with a user through the embedded processor within A self- piloted car powered by the NVIDIA Jetson TK1 11/18/14 2

Embedded programmers requirements Write once, reuse anywhere o Avoid rewriting from scratch o Provide incremental migration path essential for application

3 Embedded programmers requirements Write once, reuse anywhere o Avoid rewriting from scratch o Provide incremental migration path essential for application codes o Exploit multiple levels of parallelism with familiar and/or commodity programming models o None are perfect, but industry adoption is critical 11/18/14 3

4 OpenMP widely adopted standard ( Industry standard for shared memory parallel programming o Supported by large # of vendors (TI, AMD, IBM, Intel, Cray, NVIDIA, HP..) o OpenMP 4.0 provides support for compute devices such as Xeon Phis, GPUs and others! High level directive-based solution o Compiler directives, library routines and envt. variables o Programming Model - Fork-Join parallelism o Threads communicate by sharing variables o Synchronizations to control race conditions o Structured programming to reduce likelihood of bugs Programming made easy! void main() { double Res[1000]; #pragma omp parallel fo for(int i=0;i<1000;i++) { do_huge_comp(res[i] } 11/18/14 4

5 OpenMP Solution Stack 5 OpenMP Parallel Computing Solution Stack User layer OpenMP Application Prog. layer OpenMP API Directives, Compiler OpenMP library Environment variables System layer Runtime library OS/system support for shared memory Core 1 Core 2 Core n 11/18/14 5

OpenMP for Embedded Systems " Programming embedded systems a challenge " Need high-level standards such as OpenMP " Runtime relies on lower level components OS, threading/hardware libraries, memory

6 OpenMP for Embedded Systems " Programming embedded systems a challenge " Need high-level standards such as OpenMP " Runtime relies on lower level components OS, threading/hardware libraries, memory allocation, synchronization e.g. Linux, Pthreads o But embedded systems typically lack some of these features " OpenMP has shared-memory cache-coherent memory model But embedded platforms feature distributed, nonuniform memory, with no cache-coherency Vocabulary for heterogeneity is required in the embedded space OpenMP 4.0 is there!! 11/18/14 6

Memory - System Metadata Task Management (MTAPI) - Task lifecycle - Task placement - Task

7 Multicore Association Industry standard API (MCA) MCA Foundation APIs Communications (MCAPI) - Lightweight messaging Resource Management (MRAPI) - Basic synchronization - Shared/Distributed Memory - System Metadata Task Management (MTAPI) - Task lifecycle - Task placement - Task priority SW/HW Interface for Multicore/Manycore (SHIM) -XML HW description from SW perspective 11/18/14 7

8 Tasks in Heterogeneous Systems Domain Node Node Node Node MTAPI application MTAPI tasks MTAPI tasks tasks MTAPI tasks MTAPI runtime system (optionally MCAPI / MRAPI) sched. / lib. OS 1 OS 2 DSP CPU core CPU core CPU core CPU core GPU memory memory memory Tasks execute a job, implemented by an action function, on a local or remote node Task can be started individually or via queues (to influence the scheduling behavior) Arguments are copied to the remote node Results are copied back to the node that started the task Sven Brehmer, MCA Talk at the OpenMP Booth, SC13 11/18/14 8

9 Tasks, Jobs, Actions Node 1 task a Node 2 Task- based applicatio n task b job a job b action a action b task c task d job c Node 3 action a action c job d action d 11/18/14 9

Runtime Design and Optimizations Optimized Thread Creation, Waiting and Awakening o All threads in a team cannot be identical o Uses MRAPI meta data

10 Runtime Design and Optimizations Optimized Thread Creation, Waiting and Awakening o All threads in a team cannot be identical o Uses MRAPI meta data primitives o Avoid over-subscription o Distributed spin-waiting Synchronization Construct Memory Model o Uses MRAPI shared/remote memory primitives 10

11 Freescale s Communication processor with data path QorlQ P4080 processor 4-8 Power architecture e500mc cores Accelerators Encryption (SEC) Pattern Matching Engine (PME) Target applications: Aerospace and Defense Ethernet Switch, Router Pre-crash detection Forward Collision Warning hzp:// code=p4080&tid=redp /18/14 11

Portable OpenMP Implementation Translated OpenMP for MPSoCs Used Multicore Association (MCA) APIs as

Light-weight o Supports non-cachecoherent systems o Performance comparable to customized

System'Layer' Hardware'Layer' Direc-ves' OpenMP'Applica-ons' Run-me' Library' Rou-nes'

12 Portable OpenMP Implementation Translated OpenMP for MPSoCs Used Multicore Association (MCA) APIs as target for our OpenMP translation Developed MCA-based runtime: o Portable across MPSoCs o Light-weight o Supports non-cachecoherent systems o Performance comparable to customized vendorspecific implementations Applica-on'Layer' OpenMP' Programming' Layer' MCA'APIs'Layer' System'Layer' Hardware'Layer' Direc-ves' OpenMP'Applica-ons' Run-me' Library' Rou-nes' OpenMP'Run-me'Library' Environment' Variables' MRAPI' MCAPI' MTAPI' Opera-ng'Systems'(or'Virtualiza-on)' Mul-core'Embedded'Systems' 11/18/14 12

Compilation Process OpenUH as our frontend open source, optimizing compiler suite

OpenMP runtime function calls OpenMP source code Frontend source-tosource

Compiler libeomp MCA Libraries Power Architecture GCC Compiler libmca PowerPC-GCC

generated by linking the object file, our OpenMP runtime library and the MCA

13 Compilation Process OpenUH as our frontend open source, optimizing compiler suite for C, C++ and Fortran, based on Open64 o Translates C+OpenMP source into C with OpenMP runtime function calls OpenMP source code Frontend source-tosource translation Bare C code with OpenMP runtime library calls app.c OpenUH Compiler app.w2c.c Power Architecture GCC Compiler OpenMP Runtime Library Power Architecture GCC Compiler libeomp MCA Libraries Power Architecture GCC Compiler libmca PowerPC-GCC as our backend to generate the object file and libraries Final executable file is generated by linking the object file, our OpenMP runtime library and the MCA runtime library. Object code app.w2c.o Power Architecture GCC Linker Dual-core power processor from Freescale Semiconductor app.out Executable image running on the board 11/18/14 13

Enhanced OpenMP runtime Vs proprietary runtime DIJKSTRA JACOBI Normalized Execution Time 1.2 1 0.8 0.6 0.4 0.

14 Enhanced OpenMP runtime Vs proprietary runtime DIJKSTRA JACOBI Normalized Execution Time Number of Threads libgomp libeomp Normalized Time Number of Threads libgomp libeomp FFT LU Decomposition Normalized Execution Time Number of Threads libgomp libeomp Normalized Execution Time Number of Threads libgomp libeomp 11/18/14 14

OpenMP + MCA for heterogeneous systems QorlQ P4080 processor 4-8 Power architecture e500mc cores Accelerators Encryption (SEC) Pattern Matching Engine (PME) Target applications:

15 OpenMP + MCA for heterogeneous systems QorlQ P4080 processor 4-8 Power architecture e500mc cores Accelerators Encryption (SEC) Pattern Matching Engine (PME) Target applications: Aerospace and Defense Ethernet Switch, Router Pre-crash detection Forward Collision Warning hzp:// code=p4080&tid=redp /18/14 15

16 Creating an MCA wrapper For communication between the power processor and PME MCA Wrapper DMA 11/18/14 16

17 MCAPI Connectionless technique 11/18/14 17

18 MxAPI for accelerators lessons learned PME does not share memory with the main processor Data movement via a DMA channel Required thorough knowledge of low-level API very tightly tied to the hardware o Time consuming o Requires constant support from the vendor to understand the low-level API Created an MCA wrapper to abstract all lowlevel details o However the wrapper can be used for devices that relies on that same low-level API. 11/18/14 18

MTAPI Design and Implementation- Current Status On-going work: Implementing MTAPI features Writing small test cases to validate the MTAPI implementation Need to evaluate

19 MTAPI Design and Implementation- Current Status On-going work: Implementing MTAPI features Writing small test cases to validate the MTAPI implementation Need to evaluate the MTAPI implementation on a heterogeneous multicore platform Preliminary results demonstrated overhead while communicating with a remote node through MCAPI 11/18/14 19

OpenMP and MCA API Applica-on'Layer' OpenMP'Applica-ons' OpenMP' Programming' Layer' Direc-ves' Run-me' Library' Rou-nes' OpenMP'Run-me'Library' Environment'

20 OpenMP and MCA API Applica-on'Layer' OpenMP'Applica-ons' OpenMP' Programming' Layer' Direc-ves' Run-me' Library' Rou-nes' OpenMP'Run-me'Library' Environment' Variables' MCA'APIs'Layer' MRAPI' MCAPI' MTAPI' System'Layer' No OS?? Opera-ng'Systems'(or'Virtualiza-on)' Hardware'Layer' Mul-core'Embedded'Systems' 11/18/14 20

Summary Extend OpenMP runtime library with MxAPI as the translation layer to target heterogeneous multicore SoCs o MTAPI prototype implementation on-going @ UH

21 Summary Extend OpenMP runtime library with MxAPI as the translation layer to target heterogeneous multicore SoCs o MTAPI prototype implementation UH SIEMENS : /10/31/siemens-produces-opensource-code-multicore-acceleration/ o Targeted a specialized accelerators 11/18/14 21

22 Accelerators are more than just GPUs 2013 onwards Before 2000 CPUs Intel Xeon Phi Intel Sandybridge Intel 2010 DSP Haswell + FPGAs Virtex 5 ARM FPGAs Virtex Convey Tilera Convey 7 AMD Berlin IBM Cyclops64 IBM Power AMD Warsaw Cell BE Xtreme 7 IBM Power DATA Blue Gene/Q 8 SGI RASC Nvidia Kepler Nvidia Maxwell Nvidia Nvidia Pascal Volta 11/18/14 22

23 MCAPI Communication API MCAPI Domain 1 MCAPI Node 1 endpoint <1,1,1> attributes port 1 endpoint <1,1,2> attributes port 2 Connectionless Message to <1,2,1> Messages: Connectionless - More flexible, less configuration - Blocking and non-blocking - Prioritized messages Connectionless Has More Flexibility MCAPI Node 2 endpoint <1,2,1> port 1 attributes Connectionless Message Connection- oriented Channels MCAPI Domain 1 MCAPI Node 1 endpoint <1,1,1> attributes port 1 Packets and Scalars: Connected - More efficient, more configuration - Blocking and non-blocking packets - Blocking scalars Connected Is More Efficient MCAPI Node 3 endpoint <1,1,2> attributes port 2 Packet Channel endpoint <1,3,1> port 1 attributes MCAPI Node 4 endpoint <1,4,1> attributes port 2 Scalar Channel endpoint <1,3,2> port 2 attributes 11/18/14 23

24 MRAPI Memory Concept LEGEND program data accesses data movement MRAPI API calls hardware/implementation specific MRAPI implementation activity local memory DMA Engine local memory local memory rmem buffer local buffer local buffer SW Cache native rd/wr MRAPI API rd/wr native rd/wr native rd/wr MRAPI API rd/wr ptr mrapi_rmem_handle mrapi_rmem_handle node0 node1 ptr ptr node2 core0 core1 core2 11/18/14 24

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,