Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19

Size: px

Start display at page:

Download "Single-source SYCL C++ on Xilinx FPGA. Xilinx Research Labs Khronos 2017/11/12 19"

Jeremy Ross Cummings
6 years ago
Views:

1 Single-source SYCL C++ on Xilinx FPGA Xilinx Research Labs Khronos 2017/11/12 19

2 Khronos standards for heterogeneous systems 3D for the Web - Real-time apps and games in-browser - Efficiently delivering runtime 3D assets Connecting Software to Silicon Vision and Neural Networks - Tracking and odometry - Scene analysis/understanding - Neural Network inferencing Real-time 2D/3D - Virtual and Augmented Reality - Cross-platform gaming and UI - CG Visual Effects Parallel Computation - Machine Learning acceleration - Embedded vision processing - High Performance Computing (HPC) - CAD and Product Design - Safety-critical displays

3 PROMOTER MEMBERS Over 100 members worldwide Any company is welcome to join Copyright Khronos Group 2017

4 Complete example of matrix addition in OpenCL SYCL #include <CL/syclhpp> buffer B { &b[0][0], range { N, M } }; #include <iostream> using namespace cl::sycl; constexpr size_t N = 2; constexpr size_t M = 3; using Matrix = float[n][m]; buffer C { &c[0][0], range { N, M } }; // Enqueue some computation kernel task qsubmit([&](handler& cgh) { // Define the data used/produced auto ka = Aget_access<access::mode::read>(cgh); auto kb = Bget_access<access::mode::read>(cgh); // Compute sum of matrices a and b into c int main() { Matrix a = { { 1, 2, 3 }, { 4, 5, 6 } }; Matrix b = { { 2, 3, 4 }, { 5, 6, 7 } }; auto kc = Cget_access<access::mode::write>(cgh); // Create & call kernel named "mat_add" cghparallel_for<class mat_add>(range { N, M }, [=](id<2> i) { kc[i] = ka[i] + kb[i]; } ); Matrix c; }); // End of our commands for this queue } // End scope, so wait for the buffers to be released {// Create a queue to work on default device queue q; // Wrap some buffers around our data buffer A { &a[0][0], range { N, M } }; Page 4 // Copy back the buffer data with RAII behaviour std::cout << "c[0][2] = " << c[0][2] << std::endl; return 0; }

trisycl Open Source SYCL 12/22 Uses C++17 templated classes Used by Khronos to define the SYCL and OpenCL C++ standard Languages are now too complex to be defined without implementing On-going

5 trisycl Open Source SYCL 12/22 Uses C++17 templated classes Used by Khronos to define the SYCL and OpenCL C++ standard Languages are now too complex to be defined without implementing On-going implementation started at AMD and now led by Xilinx OpenMP for host parallelism BoostCompute for OpenCL interaction Prototype of device compiler for Xilinx FPGA Page 5

https://githubcom/trisycl/trisycl architecture #include <CL/syclhpp> [] C++ SYCL qsubmit([&](auto &cgh) { // The kernel

$space cghparallel_for<class init_a>({ N, M }, [=] C++ (autosycl index) { A[index] = index[0]*2 + index[1]; }); });}$ #include <CL/syclhpp> libopenclso Clang/LLVM Host & kernel caller OpenMP CPU executable C++17 & OpenMP & Boost OpenCL

#include <CL/syclhpp> libopenclso Clang/LLVM Host & kernel caller OpenMP CPU executable C++17 & OpenMP & Boost OpenCL

6 architecture #include <CL/syclhpp> [] C++ SYCL qsubmit([&](auto &cgh) { // The kernel write a, so get a write accessor on it auto A = aget_access<access::mode::write>(cgh); Unmodified host compiler (gcc/clang/vs/icc) For OpenCL interoperability OpenMP CPU executable // Enqueue parallel kernel on a N*M 2D iteration space cghparallel_for<class init_a>({ N, M }, [=] C++ (autosycl index) { A[index] = index[0]*2 + index[1]; }); });} #include <CL/syclhpp> libopenclso Clang/LLVM Host & kernel caller OpenMP CPU executable C++17 & OpenMP & Boost OpenCL interoperability (BoostCompute) SYCL runtime Clang/LLVM device compiler kernelsbin Device Compiler Runtime SPIR 20 de facto Vendor OpenCL device compiler

7 SPIR 20 de facto output with Clang 391 using; ModuleID = 'device_compiler/single_task_vector_add_drtkernelbc' source_filename = "device_compiler/single_task_vector_add_drtcpp" target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-s128" target triple = "spir64" "_ZZZ9test_mainiPPcENK3$_1clERN2cl4sycl7handlerEENKUlvE_clEvexit": ; preds = %forbodyi } ret void declare gxx_personality_v0() ; Function Attrs: noinline norecurse nounwind uwtable define spir_kernel addrspace(1)* %f000val, i32 addrspace(1)* %f010val, i32 addrspace(1)* %f020val) unnamed_addr #0!kernel_arg_addr_space!3!kernel_arg_type!4!kernel_arg_base_type!4!kernel_arg_type_qual!5!kernel_arg_access_qual!6 {!llvmident =!{!0} attributes #0 = { noinline norecurse nounwind uwtable "disable-tail-calls"="false" "less-precisefpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } entry: br label %forbodyi!openclspirversion =!{!1}!opencloclversion =!{!2} forbodyi: ; preds = %forbodyi, %entry %indvarsivi = phi i64 [ 0, %entry ], [ %indvarsivnexti, %forbodyi ] %arrayidxii = getelementptr inbounds i32, i32 addrspace(1)* %f010val, i64 %indvarsivi %0 = load i32, i32 addrspace(1)* %arrayidxii, align 4,!tbaa!7 %arrayidxi15i = getelementptr inbounds i32, i32 addrspace(1)* %f020val, i64 %indvarsivi %1 = load i32, i32 addrspace(1)* %arrayidxi15i, align 4,!tbaa!7 %addi = add nsw i32 %1, %0 %arrayidxi13i = getelementptr inbounds i32, i32 addrspace(1)* %f000val, i64 %indvarsivi store i32 %addi, i32 addrspace(1)* %arrayidxi13i, align 4,!tbaa!7 %indvarsivnexti = add nuw nsw i64 %indvarsivi, 1 %exitcondi = icmp eq i64 %indvarsivnexti, 300!0 =!{!"clang version 391 "}!1 =!{i32 2, i32 0}!2 =!{i32 1, i32 2}!3 =!{i32 1, i32 1, i32 1}!4 =!{!"int *",!"int *",!"int *"}!5 =!{!"",!"",!""}!6 =!{!"read_write",!"read_write",!"read_write"}!7 =!{!8,!8, i64 0}!8 =!{!"int",!9, i64 0}!9 =!{!"omnipotent char",!10, i64 0}!10 =!{!"Simple C++ TBAA"} br i1 %exitcondi, label %"_ZZZ9test_mainiPPcENK3$_1clERN2cl4sycl7handlerEENKUlvE_clEvexit", label %forbodyi Page 7

8 After Xilinx SDx xocc ingestion Page 8

9 After Xilinx SDx xocc ingestion FPGA layout! Page 9

10 Code execution on real FPGA (device)$ device_compiler/single_task_vector_add_drtkernel_caller binary_size = task::add_prelude task::add_prelude task::add_prelude accessor(accessor &a) : &a = 0x7ffd39395f40 &buffer =0x7ffd39395f50 accessor(accessor &a) : &a = 0x7ffd39395f30 &buffer =0x7ffd39395f60 accessor(accessor &a) : &a = 0x7ffd39395f20 &buffer =0x7ffd39395f70 single_task &f = 0x7ffd39395f50 task::prelude schedule_kernel &k = 0x Setting up _ZN2cl4sycl6detail18instantiate_kernelIZZ9test_mainiPPcENK3$_1clERNS0_7handlerEE3addZZ9test_mainiS4_ENKS5_clES7_EUlvE_EEvT0_ aka TRISYCL_kernel_0 Name device xilinx_adm-pcie-7v3_1ddr_3_0 serialize_accessor_arg index =0, size = 4, arg = 0 serialize_accessor_arg index =1, size = 4, arg = 0x1 serialize_accessor_arg index =2, size = 4, arg = 0x2 **** no errors detected Page 10

11 Page 11

Copyright Khronos Group Page 1

Copyright Khronos Group Page 1 SYCL and OpenCL State of the Nation Michael Wong ISOCPP VP Codeplay Vice President of R & D SYCL Working Group Chair Chair C++ Standard SG5, SG14 michael@codeplay.com wongmichael.com Ronan Keryell Xilinx