Multi-Device Basic. Sample User's Guide. Intel SDK for OpenCL* Applications - Samples. Document Number: US

Similar documents
Overview. Features. Intel Media SDK 2014 Audio Library has API version 1.8. The following audio formats are supported: Decoding

Contents: Module. Objectives. Lesson 1: Lesson 2: appropriately. As benefit of good. with almost any planning. it places on the.

Xilinx Answer Xilinx PCI Express DMA Drivers and Software Guide

SOLA and Lifecycle Manager Integration Guide

EView/400i Management Pack for Systems Center Operations Manager (SCOM)

Computer Organization and Architecture

Admin Report Kit for Exchange Server

It has hardware. It has application software.

Summary. Server environment: Subversion 1.4.6

ClassFlow Administrator User Guide

B Tech Project First Stage Report on

CSE 3320 Operating Systems Synchronization Jia Rao

Overview of Data Furnisher Batch Processing

An Introduction to Crescendo s Maestro Application Delivery Platform

Technical Paper. Installing and Configuring SAS Environment Manager in a SAS Grid Environment with a Shared Configuration Directory

HP Server Virtualization Solution Planning & Design

Dell EqualLogic PS Series Arrays: Expanding Windows Basic Disk Partitions

Maximo Reporting: Maximo-Cognos Metadata

Please contact technical support if you have questions about the directory that your organization uses for user management.

HP ExpertOne. HP2-T21: Administering HP Server Solutions. Table of Contents

SAP Business One Hardware Requirements Guide

Because this underlying hardware is dedicated to processing graphics commands, OpenGL drawing is typically very fast.

HPE AppPulse Mobile. Software Version: 2.1. IT Operations Management Integration Guide

VMware AirWatch Certificate Authentication for Cisco IPSec VPN

Integration Framework for SAP Business One

Eastern Mediterranean University School of Computing and Technology Information Technology Lecture2 Functions

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Remoting SDK Release Notes

Quick Guide on implementing SQL Manage for SAP Business One

Kaltura MediaSpace Installation and Upgrade Guide. Version: 5.0

Software Engineering

Troubleshooting Citrix- Published Resources Configuration in VMware Identity Manager

Technical Paper. Installing and Configuring SAS Environment Manager in a SAS Grid Environment

MediaTek LinkIt Development Platform for RTOS Memory Layout Developer's Guide

Parallel Processing in NCAR Command Language for Performance Improvement

INSTALLING CCRQINVOICE

MySabre API RELEASE NOTES MYSABRE API VERSION 2.0 (PART OF MYSABRE RELEASE 7.0) OCTOBER 28, 2006 PRODUCTION

Due Date: Lab report is due on Mar 6 (PRA 01) or Mar 7 (PRA 02)

MySabre API RELEASE NOTES MYSABRE API VERSION 2.1 (PART OF MYSABRE RELEASE 7.1) DECEMBER 02, 2006 PRODUCTION

Introduction to Mindjet on-premise

BlackBerry Server Installation and Upgrade Service

DS-5 Release Notes. (build 472 dated 2010/04/28 08:33:48 GMT)

Lab 1 - Calculator. K&R All of Chapter 1, 7.4, and Appendix B1.2

ECE 545 Project Deliverables

Relius Documents ASP Checklist Entry

OO Shell for Authoring (OOSHA) User Guide

Assignment #5: Rootkit. ECE 650 Fall 2018

Date: October User guide. Integration through ONVIF driver. Partner Self-test. Prepared By: Devices & Integrations Team, Milestone Systems

Dashboard Extension for Enterprise Architect

CS4500/5500 Operating Systems Synchronization

Aras Innovator 11. Client Settings for Chrome on Windows

CMC Blade BIOS Profile Cloning

Getting Started with the SDAccel Environment on Nimbix Cloud

RISKMAN REFERENCE GUIDE TO USER MANAGEMENT (Non-Network Logins)

Software Toolbox Extender.NET Component. Development Best Practices

Project #1 - Fraction Calculator

CodeSlice. o Software Requirements. o Features. View CodeSlice Live Documentation

Customer Information. Agilent 2100 Bioanalyzer System Startup Service G2949CA - Checklist

UML : MODELS, VIEWS, AND DIAGRAMS

Lab 5 Sorting with Linked Lists

DELL EMC VxRAIL vcenter SERVER PLANNING GUIDE

Using CppSim to Generate Neural Network Modules in Simulink using the simulink_neural_net_gen command

FIREWALL RULE SET OPTIMIZATION

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on HP ProLiant Server and Microsoft SQL Server 2005

Troubleshooting Citrix- Published Resources Configuration in VMware Identity Manager

Cisco Tetration Analytics, Release , Release Notes

Courseware Setup. Hardware Requirements. Software Requirements. Prerequisite Skills

GPA: Plugin for OS Command With Solution Manager 7.1

Common Language Runtime

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on HP Integrity Server and Microsoft SQL Server 2005

App Orchestration 2.6

Aras Innovator 11. Package Import Export Utilities

This document lists hardware and software requirements for Connected Backup

Release Notes Version: - v18.13 For ClickSoftware StreetSmart September 22, 2018

DELL EMC PERSONALIZED SUPPORT SERVICES

HPE LoadRunner Best Practices Series. LoadRunner Upgrade Best Practices

RELEASE NOTES FOR PHOTOMESH 7.3.1

Enterprise Chat and Developer s Guide to Web Service APIs for Chat, Release 11.6(1)

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on IBM eserver p690 and IBM DB2 UDB on eserver p5 570

Using SPLAY Tree s for state-full packet classification

Hitachi Server Adapter for the SAP HANA Cockpit

TL 9000 Quality Management System. Measurements Handbook. SFQ Examples

Element Creator for Enterprise Architect

Intro. to Computer Repair & Advanced Computer Repair

CounterSnipe Software Installation Guide Software Version 10.x.x. Initial Set-up- Note: An internet connection is required for installation.

CLOUD & DATACENTER MONITORING WITH SYSTEM CENTER OPERATIONS MANAGER. Course 10964B; Duration: 5 Days; Instructor-led

DocAve 6 Granular Backup and Restore

Aras Innovator 11. Client Settings for Chrome on Windows

CS4500/5500 Operating Systems Computer and Operating Systems Overview

Lab 0: Compiling, Running, and Debugging

Oracle Database 11g Replay: The In-built Recorder for Real Application Testing

OVAL Language Design Document

Project Extranet User Guide

CONTROL-COMMAND. Software Technical Specifications for ThomX Suppliers 1.INTRODUCTION TECHNICAL REQUIREMENTS... 2

Troubleshooting of network problems is find and solve with the help of hardware and software is called troubleshooting tools.

Design Patterns. Collectional Patterns. Session objectives 11/06/2012. Introduction. Composite pattern. Iterator pattern

TDR and Trend Micro. Integration Guide

Mapping between DFDL 1.0 Infoset and XML Data Model

WinEst 15.2 Installation Guide

USO RESTRITO. SNMP Agent. Functional Description and Specifications Version: 1.1 March 20, 2015

Transcription:

Sample User's Guide Intel SDK fr OpenCL* Applicatins - Samples Dcument Number: 329763-004US

Cntents Legal Infrmatin... 3 Abut Multi-Device Basic Sample... 4 Algrithm... 4 OpenCL* Implementatin... 5 System-Level Scenari... 5 Multi-Cntext Scenari... 6 Shared-Cntext Scenari... 7 Chsing an Apprpriate Scenari... 7 Understanding the OpenCL* Perfrmance Characteristics... 8 Saturating Device Capabilities... 8 Wrk-grup Size Cnsideratins... 8 Prject Structure... 8 APIs Used... 9 Cntrlling the Sample... 9 Understanding the Sample Output... 10 References... 11 2

Legal Infrmatin Legal Infrmatin INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Missin Critical Applicatin" is any applicatin in which failure f the Intel Prduct culd result, directly r indirectly, in persnal injury r death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes t specificatins and prduct descriptins at any time, withut ntice. Designers must nt rely n the absence r characteristics f any features r instructins marked "reserved" r "undefined". Intel reserves these fr future definitin and shall have n respnsibility whatsever fr cnflicts r incmpatibilities arising frm future changes t them. The infrmatin here is subject t change withut ntice. D nt finalize a design with this infrmatin. The prducts described in this dcument may cntain design defects r errrs knwn as errata which may cause the prduct t deviate frm published specificatins. Current characterized errata are available n request. Cntact yur lcal Intel sales ffice r yur distributr t btain the latest specificatins and befre placing yur prduct rder. Cpies f dcuments which have an rder number and are referenced in this dcument, r ther Intel literature, may be btained by calling 1-800-548-4725, r g t: http://www.intel.cm/design/literature.htm. Intel prcessr numbers are nt a measure f perfrmance. Prcessr numbers differentiate features within each prcessr family, nt acrss different prcessr families. G t: http://www.intel.cm/prducts/prcessr_number/. Sftware and wrklads used in perfrmance tests may have been ptimized fr perfrmance nly n Intel micrprcessrs. Perfrmance tests, such as SYSmark and MbileMark, are measured using specific cmputer systems, cmpnents, sftware, peratins and functins. Any change t any f thse factrs may cause the results t vary. Yu shuld cnsult ther infrmatin and perfrmance tests t assist yu in fully evaluating yur cntemplated purchases, including the perfrmance f that prduct when cmbined with ther prducts. Intel, Intel lg, Intel Cre, VTune, Xen are trademarks f Intel Crpratin in the U.S. and ther cuntries. * Other names and brands may be claimed as the prperty f thers. OpenCL and the OpenCL lg are trademarks f Apple Inc. used by permissin frm Khrns. Micrsft prduct screen sht(s) reprinted with permissin frm Micrsft Crpratin. Cpyright 2010-2013 Intel Crpratin. All rights reserved. Optimizatin Ntice Intel's cmpilers may r may nt ptimize t the same degree fr nn-intel micrprcessrs fr ptimizatins that are nt unique t Intel micrprcessrs. These ptimizatins include SSE2, SSE3, and SSSE3 instructin sets and ther ptimizatins. Intel des nt guarantee the availability, functinality, r effectiveness f any ptimizatin n micrprcessrs nt manufactured by Intel. Micrprcessr-dependent ptimizatins in this prduct are intended fr use with Intel micrprcessrs. Certain ptimizatins nt specific t Intel micrarchitecture are reserved fr Intel micrprcessrs. Please refer t the applicable prduct User and Reference Guides fr mre infrmatin regarding the specific instructin sets cvered by this ntice. Ntice revisin #20110804 3

Abut Multi-Device Basic Sample The Multi-Device Basic sample is an example f utilizing the capabilities f a multi-device system. Such systems might have different hardware setups, fr example: Systems based n CPU and GPU devices, where ne OpenCL device is a regular CPU and anther is an n-chip r a discrete GPU card. HPC systems based n CPU and discrete GPUs r acceleratr devices like Intel Xen Phi cprcessrs. This sample targets systems with multiple Intel Xen Phi cprcessr devices, but its guidelines and methds are als applicable t multi-device systems with CPU and GPU devices, r a CPU and ne Intel Xen Phi cprcessr device. Fr ptimal utilizatin f devices, yu need t reduce idle time by lading multiple devices simultaneusly. The Multi-Device Basic sample exemplifies three basic scenaris f simultaneus utilizatin f multiple devices under the same system: System-level Multi-cntext Shared cntext This sample demnstrates a minimal sequence f steps t keep all devices busy simultaneusly. It cnsists f: A simple synthetic kernel, perating in 1-dimensinal iteratin space. A simple wrk partitining strategy, which cmprises dividing all wrk amng devices equally, regardless f their cmpute capabilities. The sample utilizes n data sharing and therefre n synchrnizatin between the devices. This is a purely functinal sample with n perfrmance instrumentatin and n perfrmance reprted as sample utput. Algrithm The sample calculates a synthetic functin, which is implemented in the kernel, n a pair f input buffers a and b, and puts the resulting values t the utput buffer c. Each buffer cnsists f wrk_size elements. Fr each index i the sample calculates c[i] = f(a[i], b[i]), where i = 0..wrk_size-1 f is a functin, implemented in the kernel. Initial values fr buffer elements are als synthetic: a[i] = i, b[i] = 2*i. The aim f the sample is t demnstrate hw t divide, allcate, share resurces, and lad several devices in the system simultaneusly. The prblem is simplified by excluding brders, hals, r ther data shared between adjacent devices, which helps t mit verlapping between devices during resurce partitining. The sample demnstrates basic steps t saturate multiple devices withut cmplex explanatins f access patterns and ther issues. The Multi-Device Basic sample utilizes a static wrk partitining apprach instead f dynamic lad balancing amng devices. See the ther SDK samples at sftware.intel.cm/en-us/vcsurce/tls/pencl. In the cnsidered scenaris the sample uses cmmn math fr wrk partitining, which is suitable fr the case where the number f data items is much larger than the number f devices. Accrding t this strategy, wrk is divided amng devices evenly, and the nn-dividable piece is assigned t the last device. Fr a case where yu cannt divide data with small granularity, yu need t utilize a different math t distribute the last piece f wrk amng several devices fr better lad balance. 4

OpenCL* Implementatin OpenCL* Implementatin The fllwing scenaris are cnsidered in this sample. Scenari names are nt cnventinal. Each name is an alias f the respective scenari in the dcument and sample cde. System-level scenari, where separate devices are picked up by different instances f the same applicatin. Each instance gets its index and recgnizes hw many applicatin instances run simultaneusly t crrectly divide wrk and prcess a specific prtin. Multi-cntext scenari, where ne applicatin instance uses all devices, and each device has its wn OpenCL* cntext. Shared-cntext scenari, where all devices are placed in the same shared cntext and share input and utput buffers using sub-buffers. Refer t the scenari-dedicated sectins in this dcument t understand which scenari best suits yur needs. System-Level Scenari In the system-level scenari, multi-device parallelism is implemented utside f the hst applicatin. Multiple instances f the sample applicatin run simultaneusly under the same system. The fllwing illustrates the system-level scenari: Original data items/wrk space Applicatin instance 1 OpenCL cntext Buffer Cmmand queue Device Hst array Applicatin instance N OpenCL cntext Buffer Cmmand queue Device Hst array Prgram & kernel Prgram & kernel Fig. 1: System-level scenari If yu already have an OpenCL-enabled applicatin with ability t partitin wrk between multiple instances, fr example, thrugh MPI, then yu d nt need t mdify this applicatin t use multiple devices. Just run ne instance per each device. In case yu have an MPI-enabled and OpenCL-enabled applicatin (fr a cluster) that is capable f utilizing ne device, and yu want t utilize a machine with multiple Intel Xen Phi cprcessrs, yu d nt need t make any adjustments in wrk partitining. Run ne applicatin instance per each Intel Xen Phi cprcessr under the same system. Cprcessrs have equal cmpute pwer, which means that the apprach f dividing wrk between devices evenly prvides the desired perfrmance scalability, assuming that the executin time is distributed amng wrk items unifrmly. Using the system-level scenari, yu shuld limit the number f devices fr each applicatin instance externally, using ne f the fllwing methds: 5

Setting the OpenCL device type with different values depending n applicatin instance. In such a case, different applicatin instances use different types f devices. Fr example, ne instance uses CPU device, while anther uses the cprcessr device. T set the device type, use the t cmmand-line ptin. Enabling the OFFLOAD_DEVICES envirnment variable n the systems with multiple Intel Xen Phi cprcessrs. OFFLOAD_DEVICES des nt require any special prcessing by applicatins, as the envirnment variable is supprted at OpenCL implementatin level. OFFLOAD_DEVICES limits the Intel Xen Phi cprcessr device visibility t a particular prcess in the system. Cmbinatin: set the device type and enable the OFFLOAD_DEVICES envirnment variable. In such case yu can use a cmbinatin f CPU device and multiple cprcessrs. This sample implements the synthetic algrithm that des nt invlve any inter-device cmmunicatin, s the cde des nt rganize inter-instance interactin. Each applicatin instance wrks individually and independently frm thers. Each instance calculates which wrk items t prcess s that all instances calculate the cmplete result but d nt cllect the resulting values int ne place. Multi-Cntext Scenari In the multi-cntext scenari, ne applicatin instance uses all devices. Yet each device has its wn cntext, s prgrams, kernels, buffers, and ther resurces are nt shared. Yu shuld create the resurces individually fr each device. Yu can share nly the memry allcatin n the hst. Each device explits a separated piece f the allcated hst memry by using its wn buffer, created with CL_MEM_USE_HOST_PTR. Individual paths fr each f the devices start frm almst very beginning, frm the cntext creatin. The same cde is executed multiple times fr different devices. S syntactically, the cde cnsists f a number f lps ver individual devices and queues. Absence f tight synchrnizatin between devices is a cnsequence f cntext separatin. Yu cannt use events frm ne cntext in ther cntexts. S the synchrnizatin that yu can rganize shuld invlve hst-side API calls. This sample explicitly waits fr cmpletin in all queues in lp ver all devices with the clfinish call. The fllwing figure illustrates the multi-cntext scenari. Applicatin instance OpenCL cntext 1 Buffer Cmmand queue Device Original data items/wrk space Hst array OpenCL cntext N Buffer Cmmand queue Device Prgram & kernel Prgram & kernel Fig. 2: Multi-cntext scenari 6

OpenCL* Implementatin Shared-Cntext Scenari In the shared-cntext scenari, prgram, kernel and all buffers are shared between all devices and exist in a single OpenCL* cntext. Hwever, t use the utput t the same buffer by multiple devices simultaneusly, yu need t create a nn-verlapping sub-buffer fr each device. See the OpenCL specificatin fr mre infrmatin. The fllwing figure illustrates the shared-cntext scenari: Applicatin instance OpenCL cntext Original data items/wrk space Hst array Buffer Sub-buffer 1 Sub-buffer N Cmmand queue 1 Cmmand queue N Device 1 Device N Prgram & kernel Fig. 3: Shared-cntext scenari scenscenari Yu can als use OpenCL events t synchrnize multiple devices withut hst-side participatin, which is pssible when all cmmand queues cexist in a single cntext. Use this pssibility t wait fr the resulting buffer t becme ready, which is the mment when all devices finish their NDRange cmmands. Specifically, use an array f event in the dependence list fr the clenqueuemapbuffer call. See mre details in the surce cde. T cllect all devices f a specified type inside a single cntext, cnsider the fllwing methds: Call clgetdeviceids, which lists the available devices. Call clcreatecntext t create a cntext fr the available devices. Call clcreatecntextfrmtype directly fr platfrm and device type. Call clgetcntextinf t query the available devices. Querying the list f devices is necessary in all methds as yu need t create a separate cmmand queue fr each device. OpenCL has n API fr creating an array f cmmand queues t simplify the prcess. This sample utilizes the methd with calling clcreatecntextfrmtype. Chsing an Apprpriate Scenari Yu need t cnsider which scenari best suits yur needs. Prefer the system-level scenari if the inter-instance cmmunicatin is nt a bttleneck in yur applicatin. Otherwise, prefer multi- r shared-cntext scenaris, which prvide mre tight 7

synchrnizatin capabilities between devices and hence better utilizatin f devices, particularly fr applicatin types that spend a lt f time fr data transfers. In the multi-cntext scenari, individual paths fr each f the devices start frm almst very beginning, frm the cntext creatin in cmparisn t shared-cntext scenari, which prvides mre sharing between devices (cmpare Fig. 2 with Fig. 3). Due t early separatin, multi-cntext scenari has less flexibility, particularly in lad balancing, and lack f tight synchrnizatin between cmmand queues, which requires mre hst participatin in inter-device scheduling. Using the multi-cntext scenari yu can create a buffer fr a dedicated device and avid extra cycles fr allcating (and ptentially duplicating) a buffer in a cntext with multiple devices. This is relevant fr the case when several Intel Xen Phi cprcessr devices are present in the OpenCL cntext, which requires sme extra time t allcate the entire buffer and might nt be suitable fr the sharedcntext scenari. Using the shared-cntext scenari yu can rganize efficient lad balancing amng devices by dynamically chsing sub-buffer sizes withut recreating riginal buffers. Depending n the device type and OpenCL implementatin, the cst f buffer creatin might be higher than the cst f subbuffer creatin, s the dynamic lad balancing apprach can be efficiently implemented with sharedcntext. Understanding the OpenCL* Perfrmance Characteristics Saturating Device Capabilities Yu need t chse an apprpriate wrk partitining scenari t assign enugh wrk t each device. Sme types f devices, like Intel Xen Phi cprcessrs, require a large number f wrk-grups that is scheduled in ne NDRange. If the number f wrk-grups is insufficient, the system may result in device starvatin and lead t lwer perfrmance n multi-device systems. In multi-device cntext scenari, the given amunt f wrk might be enugh t utilize capabilities f ne device, but nt enugh fr several devices. Cnsidering the verhead required fr multi-device partitining, dividing the wrk f ne device int several devices can be slwer than perfrming all wrk n ne device. Wrk-grup Size Cnsideratins T prvide each device with apprpriate glbal size while dividing wrk between devices yu shuld ensure enugh granularity. The glbal size f NDRange enqueued fr the device shuld be a multiple f a predefined value. Yu can query this value using clgetkernelinf with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE fr a particular kernel. The fixed granularity value satisfies minimal requirements fr each pair f a device and a kernel in a multi-device envirnment, in case the value is used fr all devices. This implies an additinal requirement fr the data partitining scheme used in the applicatin. While in the sample the additinal requirement is nt frced (t keep the surce cde shrter), in a real applicatin the hst lgic shuld fllw the requirement t achieve better perfrmance results. Generally this recmmendatin is implied by the aut-vectrizatin mdule f the cmpiler. See the Intel SDK fr OpenCL Applicatins - Optimizatin Guide fr mre infrmatin. Prject Structure All files, necessary fr sample build and executin, reside at the sample directry (MultiDeviceBasic) and in the cmmn directry f the rt directry, t which yu extract samples. MultiDeviceBasic directry cntains the fllwing files: Surce files: 8

APIs Used multidevice.hpp declaratin f main sample functins, which includes the sample scenaris and kernel creatin functin. multidevice.cpp entry pint, cmmand-line parameters definitin, and parsing, selecting amng scenaris and calling ne f them. kernel.cpp creatin f an OpenCL* prgram frm a string; kernel cde is inlined t this file. system.cpp implementatin f the system-level scenari multi.cpp implementatin f the multi-cntext scenari shared.cpp implementatin f the shared-cntext scenari Scripts t run the system-level scenari with different hardware setups: cpu+mic.system-level.sh runs the system-level scenari with tw applicatin instances: ne instance is fr CPU OpenCL device, and anther fr the Intel Xen Phi cprcessr OpenCL device. multimic.system-level.sh runs the system-level scenari with several applicatin instances, each instance is mapped fr the dedicated Intel Xen Phi cprcessr OpenCL device. cpu+multimic.system-level.sh runs the system-level scenari with CPU OpenCL device and with several applicatin instances, each instance is mapped fr a dedicated Intel Xen Phi cprcessr OpenCL device. NOTE: Multi-cntext and shared-cntext scenaris are executed directly by running the binary file with a specific cmmand-line ptin withut using any script files. Refer t the Cntrlling the Sample sectin fr mre infrmatin. Other files: Makefile builds the sample binary. README.TXT instructin n building and running the sample. Als prvides infrmatin n understanding the sample utput. APIs Used This sample uses the fllwing OpenCL hst functins: clbuildprgram clcreatebuffer clcreatecmmandqueue clcreatecntext clcreatecntextfrmtype clcreatekernel clcreateprgramwithsurce clcreatesubbuffer clenqueuemapbuffer clenqueuendrangekernel clenqueueunmapmemobject clfinish clflush clgetdeviceids clgetdeviceinf clgetplatfrmids clgetplatfrminf clreleasecmmandqueue clreleasecntext clreleasekernel clreleasememobject clreleaseprgram clsetkernelarg clwaitfrevents Cntrlling the Sample Yu can run the fllwing files in the cmmand line: 9

multidevice, the sample binary file, which is a cnsle applicatin. <hardware_setup>.system-level.sh, which is a script fr running the system-level scenari in varius hardware cnfiguratins, where <hardware_setup> is the placehlder fr a hardware setup name. The multi-cntext and the shared-cntext scenaris are executed directly by calling the sample binary with a particular --cntext cmmand-line ptin. Yu can chse platfrm, devices, and ther parameters thrugh cmmand line when calling the executable. T view all parameters, run the help cmmand:./multidevice h Help cmmand shws the fllwing help text: Optin Descriptin -h, --help -p, --platfrm number-r-string -t, --type all cpu gpu acc default <OpenCL cnstant fr device type> -c, --cntext system multi shared -s, --size <integer> --instance-cunt <integer> --instance-index <integer> Shw this help text and exit. Select platfrm, devices f which are used. Select the device by type n which the OpenCL kernel is executed. Type f the multi-device scenari used: with system-level partitining, with multiple devices and multiple cntexts fr each device r ne shared cntext fr all devices. Fr ne device in the system, system = multiple = shared. Set input/utput array size. Applicable fr system-level scenari nly. Number f applicatin instances which will participate in system-level scenari. T identify particular instance, use --instance-index key. Applicable fr system-level scenari nly. Index f instance amng all participating applicatin instances which is set by --instance-cunt key. Understanding the Sample Output The fllwing is an example f pssible multidevice binary utput fr the shared cntext with CPU and Intel Xen Phi cprcessr devices: $./multidevice Platfrms (1): [0] Intel(R) OpenCL [Selected] Executing shared-cntext scenari. Cntext was created successfully. Prgram was created successfully. Prgram was built successfully. Number f devices in the cntext: 2. Successfully created cmmand queue fr device 0. Successfully created cmmand queue fr device 1. Detected minimal alignment requirement suitable fr all devices: 128 bytes. Required memry amunt fr each buffer: 67108864 bytes. Buffers were created successfully. Sub-buffers fr device 0 were created successfully. Sub-buffers fr device 1 were created successfully. Kernel fr device 0 was enqueued successfully. Kernel fr device 1 was enqueued successfully. All devices finished executin. 10

References First, the sample utputs all available platfrms and picks ne f them (line with [Selected]). Then it reprts, which scenari is running. In the example, multidevice binary runs with n cmmand-line parameters, s it executes accrding t the shared-cntext scenari by default. Then sample reprts each significant step f the OpenCL cde executin and ends when all devices finish wrking. Nte that the sample reprts n perfrmance measures. When running the system-level scripts, several multidevice binaries run at the same time. T avid utput mix and crruptin, the utput frm each individual run frwards t a file. Output files have names, frmed by name f the running script, and device type and number, particularly: Fr the cpu+mic.system-level.sh script: cpu+mic.system-level.cpu.ut -- fr CPU device cpu+mic.system-level.acc.ut -- fr Intel Xen Phi cprcessr device Fr the multimic.system-level.sh script: multimic.system-level.acc-i.ut -- fr I-th Intel Xen Phi device, where I in {0..number f Intel Xen Phi cprcessr devices minus ne} Fr cpu+multimic.system-level.sh script: cpu+multimic.system-level.cpu.ut -- fr CPU device cpu+multimic.system-level.acc-i.ut -- fr I-th Intel Xen Phi cprcessr device References Intel SDK fr OpenCL Applicatins Optimizatin Guide 11