From Application to Technology OpenCL Application Processors Chung-Ho Chen

Similar documents
OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009

Handout 3. HSAIL and A SIMT GPU Simulator

Heterogeneous Computing

Modern Processor Architectures. L25: Modern Compiler Design

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPU programming with CUDA

! Readings! ! Room-level, on-chip! vs.!

GPUs have enormous power that is enormously difficult to use

Portland State University ECE 588/688. Graphics Processors

ECE 574 Cluster Computing Lecture 17

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Overview: Graphics Processing Units

GPU Fundamentals Jeff Larkin November 14, 2016

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Parallel Processing SIMD, Vector and GPU s cont.

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

GRAPHICS PROCESSING UNITS

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

OpenCL: History & Future. November 20, 2017

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Master Informatics Eng.

CUDA Programming Model

Spring Prof. Hyesoon Kim

Parallel Computing: Parallel Architectures Jin, Hai

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Computer Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Antonio R. Miele Marco D. Santambrogio

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

OpenCL The Open Standard for Heterogeneous Parallel Programming

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming and Simulating Fused Devices. Part 2 Multi2Sim

THREAD LEVEL PARALLELISM

OpenCL framework for a CPU, GPU, and FPGA Platform. Taneem Ahmed

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Tesla Architecture, CUDA and Optimization Strategies

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Exotic Methods in Parallel Computing [GPU Computing]

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Research Article OpenCL Performance Evaluation on Modern Multicore CPUs

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Parallel Programming Concepts. GPU Computing with OpenCL

Threading Hardware in G80

Martin Kruliš, v

Introduction to OpenCL!

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Martin Kruliš, v

CS516 Programming Languages and Compilers II

CS425 Computer Systems Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

The Bifrost GPU architecture and the ARM Mali-G71 GPU

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Real-Time Rendering Architectures

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Handout 3 Multiprocessor and thread level parallelism

Visualization of OpenCL Application Execution on CPU-GPU Systems

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

OpenCL Overview Benedict R. Gaster, AMD

Lecture-13 (ROB and Multi-threading) CS422-Spring

WHY PARALLEL PROCESSING? (CE-401)

CS427 Multicore Architecture and Parallel Computing

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

SIGGRAPH Briefing August 2014

45-year CPU Evolution: 1 Law -2 Equations

NVIDIA Fermi Architecture

ECE 574 Cluster Computing Lecture 15

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Introduction to Parallel & Distributed Computing OpenCL: memory & threads

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Multi-Processors and GPU

Computer Architecture

Parallel Computing Platforms

Introduction à OpenCL

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Transcription:

From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication Engineering National Cheng-Kung University Tainan, Taiwan

Welcome to this talk From Application OpenCL Platform Model To Technology Micro-architecture implications An OpenCL runtime on a many-core system Summary 2013/12/26 2

Welcome to this talk OpenCL Platform Model Micro-architecture implications An OpenCL RunTime on a Many-core system Summary 2013/12/26 3

Parallel Processing Platform Multi-core CPU DSPs GPU Data parallelism Portability OpenCL is a framework for building parallel applications that are portable across heterogeneous platforms. Introduction of OpenCL

Khronos Standards Ecosystem Introduction of OpenCL

Key Players NVIDIA Apple ARM IBM Intel AMD BROADCOM SAMSUNG Chip maker FPGA vendors System OEM SOC designer 2013/12/26 6

OpenCL Specification OpenCL Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing unit (CPUs), graphics processing unit (GPUs), and other processors. Introduction of OpenCL Language: OpenCL defines an OpenCL C language for writing kernels (functions that execute on OpenCL devices) also define many built-in functions. OpenCL defines application programming interfaces (APIs Platform Layer API: hardware abstraction layer; query, select and initialize compute devices; create compute contexts and work queues Runtime API: execute kernels, manage thread scheduling and memory resources 2013/12/26 7

Common Hardware Abstraction Abstraction makes life easier for those above An OpenCL device is viewed by the OpenCL programmer as a single virtual processor. Instruction Set Architecture software OpenCL platform API instruction set Multi cores OSes GPUs Processor of 600MHz, 1GHz, 2GHz, etc hardware Processor of Non-pipelined Pipelined Superscalar OOO, etc 8

From Sequential to Parallel Define N-dimensional computation system (N=, 1, 2, 3) Execute a kernel code for each point in the system 2013/12/26 9

OpenCL Platform Model Hierarchical System A host with one or more compute devices A compute device: one or more compute units A compute unit: one or more processing elements 3 2 1 2013/12/26 10

OpenCL Execution Model Application runs on a host Host submits the work to the compute devices through enqueue operation Introduction of OpenCL Work item: basic unit of work (the work or data a thread to work on) Kernel: the code for a work item An invocation of the kernel is a thread Program: Collection of kernels + lib functions Context: the environment in which work-items execute, including command queues, memories used, and devices

Work-items specified in N-dimension: N=2 Work-items are group into workgroups, e.g., 64x64 Kernels executed across a global domain of workitems Synchronization btw work-items can be done within workgroups: barrier S Whole problem space Cannot synchronize outside a workgroup Work group 2013/12/26 12

OpenCL Memory Model Private memory: per work-item Local memory: shared within a workgroup /workgroup Work item SIMT Engine Global/constant memory: visible to all workgroups Host memory : on the host CPU OpenCL 1.2: move data from host to global to OpenCL 2.0: shared virtual memory btw host and device

Compilation Dynamic runtime compilation model Intermediate Representation (IR) Assembly based on a virtual machine (done once) IR is compiled into machine code App loads IR and compiles it. 2013/12/26 14

LLVM example Low Level Virtual Machine Front-end compiler: source code to IR Back-end compiler: IR to target code IR to CPU ISA IR (PTX) to GPU ISA LLVM (Back-end) ARM Assembly ARM-GCC ARM Binary OpenCL Source Code (test.cl) Front-End Compiler Back-End Compiler Hardware Device Source Scheduler Hardware Executable Binary (test.bin) Device Driver

A Simple Example Built-in function get_global_id(0) returns the thread id. Each PE executes the same kernel code and uses its thread id to access its data. Hence, SIMT..CL code Get processor ID. This is one dimension example. host C code: Run control thread memory objects created using clcreatememobj Create this number of threads Workgroup size

Host control thread example Setup Execution Environment Create Memory/Program Objects Prepare Parameters and Copy Data for Execution Enqueue for Execution and Read Back 2013/12/26 17

Setup Execution Environment: prepare environment Get Available Devices clgetplatformids Return the number of available platforms and the pointers of the platforms clgetdeviceids Return the list of devices available on a platform Create Context clcreatecontext Return the pointer of context structure used by runtime for managing objects such as memory, command-queue, program and kernels Create Command Queue clcreatecommandqueue Return the pointer of command-queue that programmer uses to queue a set of operations

Create Memory/Program Objects Create Memory objects clcreatebuffer Return the pointer of the memory object which contains the relationship between host memory and device memory region Create Program and Kernel Objects clcreateprogramwithsource Use the CL source code to generate a program object clbuildprogram Compile the source code of the target program object; each program has more than one kernel source clcreatekernel Designate the kernel that is going to run

Prepare Parameters and Copy Data for Execution Setup Kernel Parameters clsetkernelarg Set up the parameters of the kernel function Copy Data from Main Memory to Target Device clenqueuewritebuffer Write the data from main memory to target device

Enqueue for Execution and Read Back Execute the Kernel in N-dimension clenqueuendrangekernel Declare a N-dimensional work-space (global_size) for executing Subdivide the work-space into work-group by means of setting local_size Read Back Results from Target Device clenqueuereadbuffer Read the data from target device to main memory

Executed in-order or out-of-order Setup Execution Environment Create Memory/Program Objects Prepare Parameters and Copy Data for Execution Enqueue for Execution Command Queues IOQ, OOOQ Runtime support OOOQ? Devices 2013/12/26 22

Welcome to this talk OpenCL Platform Model Micro-architecture implications An OpenCL RunTime on a Many-core system Summary 2013/12/26 23

Now Device Architecture implication for OpenCL Program Model SIMT ISA SIMT instruction scheduling 2013/12/26 24

SIMT Machine What architecture features are useful/critical for OpenCL-based computation? SIMT: single instruction multiple threading Same instruction stream works on different data 2013/12/26 25

Single Instruction Multiple Threading Single Instruction Multi- Threading A thread == a workitem Get one instruction and dispatch to every processor units. Fetch one stream -> N threads (of the same code) in execution Each thread is independent to each other. All cores execute instructions in lockstep mode. Single stream on N Cores I5 I4 I3 I2 I1 Core Core Core Core Data 1 Data 2 Data N 2013/12/26 26

SIMT: Single Instruction Multiple Threading Clarify what is what What is S? What are threads or workitems? AN INSTRUCTION STREAM IN EXECUTION Single stream on N Cores I1 I2 I3 I4 Core Core Core Core Data 1 Data 2 Data N 2013/12/26 Thread 2 27

Fetching Mechanism for SIMT Instruction Fetching Need an instruction fetcher to let each core or PE get their instruction Each PE may free run also. Data Fetching Need an efficient way to get per-pe s data from global memory (DRAM). 2013/12/26 28

ISA issues for SIMT PE Branch problem in SIMT Can not use regular branches in SIMT because If some PE gets I3 etc and some PE get I5, then there is no single instruction stream anymore. Single stream on N Cores BEQ xx xx I1 I2.. I3 I4 I5 Core Core Core Core Data 1 Data 2 Data N 2013/12/26 29

Conditional Execution for SIMT If-conversion uses predicates to transform a conditional branch into a single control stream code. if(r1 == 0) add r2, r3,r4 else sub r2, r7,r4 mov r5, r2 code using br f0: cmp r1, #0 f4: beq 0x100 f8: sub r2, r7,r4 fc: bne 0x104 100: add r2, r3, r4 104: mov r5, r2 If-converted code cmp r1, #0 addeq r2, r3,r4 subne mov r2, r7,r4 r5, r2 30

ISA issues for SIMT No branch in SIMT. Each PE simply executes the same instruction stream If the condition is met, commit the result otherwise nop. Problem: Low Utilization of PE Poor performance for branch rich App. Poor performance in SISD: clenqueuetask: the kernel is executed using a single workitem. Single instruction stream on N Cores cmp r1, #0 addeq r2, r3,r4 subne r2, r7,r4 mov r5, r2 Core Core Core Core Data 1 Data 2 Data N 2013/12/26 31

Now Device Architecture implication of OpenCL Program Model SIMT ISA SIMT instruction scheduling 2013/12/26 32

SIMT: SIMD Streaming Machine Pipelined PE/Core How to tolerate long latency instructions? Cache miss Complex integer instructions Expensive floating point operations SIMD streaming unit cmp r1, #0 addeq r2, r3,r4 subne r2, r7,r4 mov r5, r2 ldr r4, r2, 32# Core Core Core Core Data 1 Data 2 Data N 2013/12/26 33

Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 2013/12/26 34 Thread 3 Thread 4 Thread 5 Idle slot

Warp 2013/12/26 35

Terminology: Barrel threading Interleaved multi-threading Cycle i+1: an instruction from instruction stream (warp) A is issued Cycle i+2: an instruction from instruction stream (warp) B is issued The purpose of this type of multithreading is to remove all data dependency stalls from the execution pipeline. Since one warp is independent from other warps. Terminology This type of multithreading was first called Barrel processing, in which the staves of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Finegrained or time-sliced multithreading are more modern terminology. 2013/12/26 36

Warp scheduler SIMT machine fetcher fetches warps of instructions and store them into a warp queue. Warp scheduler issues (broadcasts) one instruction from a ready warp to the PEs in the SIMT machine. 2013/12/26 time C1 C2 CN IF I1 J1 z1 I1 See a core in the system N stages in a FU warp1: I1 I2 I3 I4 I5 warp2: J1 J2 J3 J4 J5.. warpn: z1 z2 z3. J1 37 I1 After N cycles, I1 completes Warp back to Issue I2 of warp1, and etc. So, if I2 depends on I1, It has a room of N cycles for execution latency.

Example: Fermi GPU Architecture SMEM: shared memory in Fermi term, but this is actually a private local scratchpad memory for a thread block communication (workgroup) Data memory hierarchy: register, L1, L2, global memory L1 + Local Scratchpad = 64KB configurable Fermi Architecture 2013/12/26 38

Example: Fermi Floor Plan 40-bit address space 40 nm TSMC 3 x 10 9 T > Nehalem-Ex(2.3) 1.x GHz 16 multithreaded SIMD processors PCI/E L1/SMEM scheduler 32 Cores +16 L/S + 4SFU 64-bits Fermi Architecture Thread Block Distributor: Workgroup distributor This GPU is a multiprocessor composed of multithreaded SIMD processors SM up to 8 thread blocks 2013/12/26 39

A Streaming Multiprocessor ie., a Multithreaded SIMD Processor An SM consists of 32 CUDA cores + some 16 Load/Store unit + 4 special functional units Registers: 32K x words L1 data cache private to each SM L1 Instruction cache L2 unified for data and texture, instruction(?), shared globally, coherent for all SMs. Instruction dispatch (A, B) fs (A+B) fd (A, C) (B, C) (A, D) (B, D), (C, D), etc A B C D Fermi Architecture 2013/12/26 L2 40

Warp Scheduler in Fermi Fermi Architecture A warp is simply an instruction stream of SIMD instructions. t1 A workgroup = several warps t1 + xx cycles Queue for warps Pick instruction from ready warps A cycle for issuing an instruction from warp 8 2013/12/26 41

Welcome to this talk OpenCL Platform Model Micro-architecture implications An OpenCL RunTime on a Many-core system Summary 2013/12/26 42

Runtime Implementation Example On an 1-to-32 ARM-core system, Build an OpenCL runtime system Resource management + On-the-fly compiling To evaluate Work-item execution methods Memory management for OpenCL memory models 2013/12/26 43

Target Platform ARM multi-core virtual platform Homogeneous many-core with shared main memory SystemC Simulation Platform On-Chip L2 Unified Cache L2 Unified Cache L2 Unified Cache L1 I$ MM U L1 D$ L1 I$ MM U L1 D$... L1 I$ MM U L1 D$ ARMv7a ISS ARMv7a ISS ARMv7a ISS Compute Unit Compute Unit Compute Unit Transaction-Level Model Bus N IRQ Signals Compute Unit Scheduler Interrupt Handler IPC Channel Handler DRAM Controllers DRAM Channels

OpenCL Runtime System Software Stack OpenCL Application Programming Interface OpenCL Runtime System A unified programming interface for various platforms Resource manager Device, memory, execution kernel etc. CL source code compiling for target device OpenCL Application Programming Interface Layer Platform Resource Manager For Programmer OpenCL Runtime Layer OpenCL Source Code Compiler OpenCL Device Driver Provide driver API to runtime OpenCL Device Driver Layer To Platform

OpenCL Source Code Compilation Translate the kernel code to binary code CL kernel source code LLVM compiler framework Front-end compiler: Intermediate representation Back-end compiler: Assembly code of target device ARM Cross Compiler Target Binary OpenCL Front-end Compiler OpenCL Back-end Compiler Cross Compiler (GNU C/C++) OpenCL Intermediate Representation (IR) Assembly Code of Target Device Target Binary Code LLVM Compiler

Runtime: Program & Kernel Management More than one kernel in a program clcreateprogramwithsource/clbuildprogram» Use LLVM compiler and ARM cross compiler to build the object code by the program source code clenqueuendrangekernel» This API decides the kernel which is going to run. Object code linking OpenCL APIs Program Source Code clcreateprogramwithsource() clbuildprogram() Program Object Code clsetkernelarg() clenqueuendrangekernel() Execution Code Kernel Source Code Kernel Source Code Kernel Source Code... Kernel Source Code Kernel Binary Kernel Binary Kernel Binary... Kernel Binary main() { // Setting parameters // Branch to the kernel } Kernel Binary Compilers LLVM Compiler ARM GCC ARM GCC ARM LD

Runtime: Memory Mapping Mapping OpenCL Progam Memory to Physical Memories Created by clcreatebuffer (Global, constant, local)» Runtime system creates a memory object through memory allocation function provided by device driver. (Map physical to OpenCL memories)» This API returns a pointer of the buffer descriptor for the mapping table. RunTime keeps this table. Local memory can be also declared by kernel source» LLVM compiler uses.bss section for variables declared with local key word.» Memory mapping in MMU set by work-item management thread per CPU core. Kernel s private memory» Use stack memory» Stack set by work-item management thread

Runtime: Data transfer Data transfer between host and target device by: clenqueuewritebuffer clenqueuereadbuffer For these API calls, the runtime system copies the data between host memory and target device memory through the mapping table kept in runtime. Mem PTR (Host) MemObject OpenCL Application Programming Interface Layer Mem PTR (Device) Mem PTR (Host) Buffer descriptor clenqueuewritebuffer() / clenqueuereadbuffer() OpenCL Runtime Layer MemObject Mem PTR (Device) Mem PTR (Host) MemObject Mem PTR (Device) OpenCL Device Driver Layer... Mem PTR (Host) MemObject Device Memory Mem PTR (Device) Memory Mapping Table

Runtime: Compute Unit Management Each ARM core is mapped to a compute unit (CU). A CU executes a work-group at a time. Runtime Scheduler Device Driver Device Command Queues (From Application) CU Status Registers......... CU CU... CU Issue Assign... CU Ready Queue... CU Work Groups

Device and Memory Mapping OpenCL Progam Model Map onto CASLAB multi-core Platform Host Processor Host CPU (INTEL i7) Host Memory Compute Device Host main memory SystemC ARM Multicore (1 to 32 core) Compute Unit SystemC ARMv7a ISS Process Element Work-item coalescing to a thread running on an ARMv7a core Global/Constant Memory Local Memory Private Memory Mulit-core Shared Memory Per ARM s memory (in shared memory) Each Work-item s Stack Memory

Simulation Platform for OpenCL runtime development OpenCL Application Layer SystemC Simulation Platform On-Chip Host program (C/C++ source code) CL Program Kernel Kernel... Kernel OpenCL Application Programming Interface Layer L2 Unified Cache L1 I$ MM U L1 D$ L2 Unified Cache L1 I$ MM U L1 D$... L2 Unified Cache L1 I$ MM U L1 D$ OpenCL Runtime Layer ARMv7a ISS ARMv7a ISS ARMv7a ISS Platform Resource Manager OpenCL Source Code On-the- Fly Compiler Compute Unit Compute Unit Transaction-Level Model Bus Compute Unit Host Application OpenCL Device Driver Layer N IRQ Signals Compute Unit Scheduler Interrupt Handler IPC Channel Handler DRAM Controllers DRAM Channels IPC Channel 2013/12/26 (Share Memory) 52

Work-item coalescing Work-items in a workgroup are emulated in a CPU core. Context switching overheads occur when switching work-item for execution. Combine the work-items in a workgroup in to a single execution thread. Need to translate the original CL code. 2013/12/26 53

New Features in OpenCL 2.0 OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 (July, 2013) Extended image support (2D/3D, depth, read/write on the same image, OpenGL) Shared virtual memory Pipes (transfer data btw multiple invocation of kernels, enable data flow operations) Android Driver 2013/12/26 54

Summary From Application OpenCL To Technology Architectural support Runtime implementation 2013/12/26 55