Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Similar documents
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Porting BLIS to new architectures Early experiences

Introduction to AM5K2Ex/66AK2Ex Processors

Elaborazione dati real-time su architetture embedded many-core e FPGA

Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y

Level-3 BLAS on the TI C6678 multi-core DSP

KeyStone C665x Multicore SoC

Keystone Architecture Inter-core Data Exchange

Questions from last time

KeyStone II. CorePac Overview

C66x KeyStone Training HyperLink

C66x KeyStone Training HyperLink

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and selecting Better Grid Geometries

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Pedraforca: a First ARM + GPU Cluster for HPC

Addressing the Memory Wall

GPUs and Emerging Architectures

JCudaMP: OpenMP/Java on CUDA

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

KeyStone Training. Turbo Encoder Coprocessor (TCP3E)

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

Intel Xeon Phi Coprocessors

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Overview of research activities Toward portability of performance

Research on the Implementation of MPI on Multicore Architectures

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Naïve (1) CS 6354: Memory Hierarchy III. Goto Fig. 4. Naïve (2)

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Master Informatics Eng.

CellSs Making it easier to program the Cell Broadband Engine processor

Chapter 4: Multi-Threaded Programming

Dense Matrix Multiplication

Introduction to Runtime Systems

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

C6000 Compiler Roadmap

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

Tracing embedded heterogeneous systems

Chapter 4: Threads. Chapter 4: Threads

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Preparing for Highly Parallel, Heterogeneous Coprocessing

CSC573: TSHA Introduction to Accelerators

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Chapter 4: Multithreaded Programming

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

HPC with Multicore and GPUs

KeyStone C66x Multicore SoC Overview. Dec, 2011

Dealing with Asymmetry for Performance and Energy Efficiency

Advanced CUDA Optimization 1. Introduction

Instructor: Leopold Grinberg

Introduction to Xeon Phi. Bill Barth January 11, 2013

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

NUMA-aware OpenMP Programming

OpenACC 2.6 Proposed Features

MYC-C7Z010/20 CPU Module

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Support for Programming Reconfigurable Supercomputers

Introduction to CELL B.E. and GPU Programming. Agenda

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Mapping applications into MPSoC

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

MYD-C7Z010/20 Development Board

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SDSoC: Session 1

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Parallel Computing. Hwansoo Han (SKKU)

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Compilation for Heterogeneous Platforms

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Overview: Graphics Processing Units

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Cell Processor and Playstation 3

Trends in HPC (hardware complexity and software challenges)

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

arxiv: v1 [cs.dc] 18 Aug 2016

MYC-C437X CPU Module

Transcription:

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

5 Generations of TI Multicore Processors Keystone architecture Lowers development effort Speeds time to market Leverages TI s investment Optimal software reuse 2

TI 66AK2H12 SoC Keystone II architecture Cores 4 ARM A15s at 1.0 GHz 4 MB shared cache 32 G flops/s single precision and 8 G flops/s double precision 8 C66x s at 1.0 GHz 64 kb scratch / cache each scratch / cache each 128 G flops/s single precision and 32 G flops/s double precision Memory 8 GB DDR3 DRAM (external) 6 MB SRAM shared Interfaces 2x Gigabit Ethernet ~ 100 MB/s 4x SRIO ~ 400 MB/s 2x Hyperlink ~ 1 GB/s 3

Open MPI Development Philosophy User view Embedded Linux running on the ARM Standard GCC tool chain Simply link to a TI provided library with an ARM callable API to accelerate applications using multiple ARM cores, cores and processors as appropriate Use TI provided tools and examples to write new applications and libraries which use multiple ARM cores, cores and processors to accelerate performance Using multiple cores on a single processor OpenMP for shared memory parallelization across ARM cores OpenCL or OpenMP Accelerator for heterogeneous acceleration with multiple cores Using multiple processors Open MPI over Ethernet, SRIO or Hyperlink User view Library API ARM 1 1 OpenMP OpenCL Processor 1 TI or user provided acceleration ARM 4 8 Processor 180 4

ARM + OpenCL Acceleration TI 66AK2H12 ARM subsystem OpenMP TI 66AK2H12 ARM subsystem OpenMP ARM 0 ARM 1 ARM 2 ARM 3 ARM 0 ARM 1 ARM 2 ARM 3 OpenCL OpenCL OpenMP 0 1 2 subsystem 3 4 5 6 7 0 1 2 subsystem 3 4 5 6 7 Data parallel - A kernel is enqueued - OpenCL divides into N workgroups - Each workgroup is assigned a core - After all workgroups finish a new kernel can be dispatched Task parallel - A task is enqueued - OpenCL dispatches tasks to cores - OpenCL can accept and dispatch more tasks asynchronously OpenCL + OpenMP regions - A task is enqueued - OpenCL dispatches the task to 0 - Tasks can use additional cores by entering OpenMP regions - A task completes before another task is dispatched - Note: This is a TI extension Example use - Want to call existing OpenMP based code 5 from the ARM

ARM + OpenMP Accelerator Acceleration // OpenMP Accelerator vector add // OpenMP for loop parallelization void ompvectoradd(int N, float *a, float *b, float *c) { #pragma omp target \ map(to: N, a[0:n], b[0:n]) \ map(from: c[0:n]) { int i; #pragma omp parallel for for (i = 0; i < N; i++) c[i] = a[i] + b[i]; } } TI 66AK2H12 ARM subsystem OpenMP ARM 0 OpenMP Accelerator 0 1 2 subsystem ARM 1 ARM 2 ARM 3 3 4 5 OpenMP 6 7 Data movement - to copies variables from the ARM memory to the memory - from copies variables from the memory to the ARM memory - TI provides special alloc and free functions to allocate memory such that copies are not needed Calling existing code from the ARM - Wrapping existing functions with OpenMP 6 Accelerator code is straightforward

Memory Shared memory visible by both the ARM and A portion of the 8GB DDR3 DRAM (external) The 6MB SRAM shared memory Performance keys Allocate data in the shared memory for ARM setup and acceleration Use clmalloc() to allocate contiguous blocks that can be efficient transferred using DMA Options Let the tools take care of the data movement using assign workgroup and strided copy functions Manually manage the data movement using DMA (e.g., define buffers available for the in OpenCL and manage the actual data movement on the ) 8 GB DRAM TI 66AK2H12 ARM subsystem 0 ARM 0 subsystem 1 4 MB ARM shared memory 2 ARM 1 ARM 2 ARM 3 6 MB ARM and shared memory 3 4 5 6 7 7

Dense Linear Algebra Philosophy 8

BLIS Cortex-A15 DGEMM Multicore Performance Peak performance: 9.6 GFLOPS DGEMM performance is ~ 8.4 GFLOPS (83% peak)) 9

Recall - Memory How can we improve this performance? The BLIS implementation on the does not utilize the different levels of memory efficiently. Utilize the DMA (Direct Memory Access) capabilities of the DMA to move data in parallel to the computations 8 GB DRAM TI 66AK2H12 ARM subsystem 0 ARM 0 subsystem 1 4 MB ARM shared memory 2 ARM 1 ARM 2 ARM 3 6 MB ARM and shared memory 3 4 5 6 7 10

Cache Exploitation and DMA 11

Cache Exploitation and DMA Details 12

DMA Integration Goals Flexible User or library developer must be able to select when and where to transfer data for an operation Transparent User must not be aware of the usage of the DMA, but if desired can manage the DMA Integrated into the control tree mechanism 13

Algorithmic Variants for GEMM 14

GEMM Control Tree Definitions 15

Algorithmic Variants for GEMM with DMA Integration 16

GEMM Control Tree Definitions with DMA Integration 17

Memory Buffers 18

Current Status of DMA Integration in GEMM Implemented multithreaded prototype of DMA Control Tree with decoding in Block Variant 1 using memcpy instead of DMA Pending Decoding of DMA Control Tree in other variants Invoking DMA routines 19

Thank you! A special thanks to Tyler M. Smith Field G. Van Zee Robert van de Geijn