Intel Array Building Blocks (Intel ArBB) Technical Presentation

Size: px

Start display at page:

Download "Intel Array Building Blocks (Intel ArBB) Technical Presentation"

Giles Hopkins
6 years ago
Views:

1 Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance and Productivity Libraries

2 Agenda Understand the key ideas behind Intel ArBB Understand the syntax Code walkthroughs Intel Parallel Building Blocks Q & A 2 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

3 Intel Array Building Blocks - Benefits Generalized data-parallel programming model Supports wide variety of patterns and collections Supports explicit dynamic generation and management of code Implementation targets both threads and vector code Machine independent optimization Offload management Machine specific code generation and optimizations Scalable threading runtime Application C++ API calling ArBB APIs Virtual Machine Virtual ISA Debug/ Svcs Memory Manager Other Language Bindings Backend JIT Compiler Threading Runtime CPU Accelerator Future 3

4 How does it work? Sequentially consistent semantics CPU Intel ArBB kernels in serial C++ app Standard C++ compiler Templates Overloaded operators Links with dynamic library Intel ArBB Runtime Dynamic compiler Threading and heterogeneous runtime Future Future 4 4

Interface: The API as a Language Syntax and semantics that extend C++ Adds parallel collection objects and methods to C++ Uses standard C++ features (templates and operator overloading) to create new

5 Interface: The API as a Language Syntax and semantics that extend C++ Adds parallel collection objects and methods to C++ Uses standard C++ features (templates and operator overloading) to create new types and operators Sequences of API calls are fused and optimized by a JIT compiler Works with standard C++ compilers Intel C++ Compiler Microsoft* Visual* C++ Compiler GNU Compiler Collection* Express algorithms using mathematical notation Developers focus on what to do, not how to do it 5

6 Code Skeleton for Intel ArBB Applications Use the following code skeleton for Intel ArBB 6 applications int main(int argc, char* argv[]) { int ret_code; try { // call into ArBB code ret_code = EXIT_SUCCESS; catch(const std::exception& e) { ret_code = EXIT_FAILURE; catch(...) { cerr << "Error: Unknown exception caught!" << endl; ret_code = EXIT_FAILURE; return ret_code; ArBB indicates runtime errors through C++ exceptions arbb::exception inherits from std::exception

7 A First Example: Vector Addition Plain C version void vecsum(float* a, float* b, float* c, int size) { for (int i=0; i<size; i++) { c[i] = a[i] + b[i]; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; Add two vectors a and b of length SIZE into vector c. vecsum(a, b, c, SIZE); 7

8 Step 1: Figure out Kernel Signature void vecsum(float* a, float* b, float* c, int size) { for (int i=0; i<size; i++) { c[i] = a[i] + b[i]; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; void vecsum(dense<f32> a, dense<f32> b, dense<f32>& c) { int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; 8 vecsum(a, b, c, SIZE);

9 Step 2: Prepare Data void vecsum(float* a, float* b, float* c, int size) { for (int i=0; i<size; i++) { c[i] = a[i] + b[i]; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; void vecsum(dense<f32> a, dense<f32> b, dense<f32>& c) { int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; dense<f32> va; bind(va, a, SIZE); dense<f32> vb; bind(vb, b, SIZE); dense<f32> vc; bind(vc, c, SIZE); 9 vecsum(a, b, c, SIZE);

10 Step 3: Set up Bridge from C/C++ to Intel ArBB void vecsum(float* a, float* b, float* c, int size) { for (int i=0; i<size; i++) { c[i] = a[i] + b[i]; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; void vecsum(dense<f32> a, dense<f32> b, dense<f32>& c) { int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; dense<f32> va; bind(va, a, SIZE); dense<f32> vb; bind(vb, b, SIZE); dense<f32> vc; bind(vc, c, SIZE); vecsum(a, b, c, SIZE); call(vecsum)(va, vb, vc); 10

11 Step 4: Implement Kernel void vecsum(float* a, float* b, float* c, int size) { for (int i=0; i<size; i++) { c[i] = a[i] + b[i]; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; void vecsum(dense<f32> a, dense<f32> b, dense<f32>& c) { c = a + b; int main(int argc, char** argv) { #define SIZE = 1024; float a[size]; float b[size]; float c[size]; dense<f32> va; bind(va, a, SIZE); dense<f32> vb; bind(vb, b, SIZE); dense<f32> vc; bind(vc, c, SIZE); vecsum(a, b, c, SIZE); call(vecsum)(va, vb, vc); 11

12 What We Learned from this Example? Create Array Building Blocks Data Structure Encapsulate operations on those structures Invoke Array Building Blocks Function 12

13 Intel ArBB Benefits Shown in this Example Ease of use Dense containers are used to implicitly express data parallelism Simple syntax invites users to focus on high-level algorithms Performance The syntax does not allow aliases, allowing aggressive optimization by the runtime Safe by design Separate memory space for Intel ArBB and C/C++ objects Intel ArBB objects can only be operated on by Intel ArBB functions. Need to use the call operator to invoke an Intel ArBB function from inside a C/C++ function. (See next slide for description of call) 13

14 Capturing Computation A call() expression works like this: If it s never seen the function passed in before, it captures the function into a closure, then executes it Otherwise, it executes the previously captured closure 14

15 Logically Separated Memory Spaces C/C++ space Intel ArBB space copyin C(++) ArBB code copyout 15 Runtime ensures that data copying happens only when required!

16 Other Benefits of Intel ArBB Sequential semantics Developers do not use threads, locks or other lower-level constructs and can avoid the associated complexity Programmers can reason and debug as if the program were serial. The ArBB dynamic execution model provides advantages Performance transparency Translation of seemingly sequential and scalar based codes into highly efficient, SIMD-ized and parallelized codes, depending on the low-level architecture. Forward scalability Automatic scaling to the increased core counts and bigger SIMD widths for future IA s. 16

18 Dot Product of Vectors Plain C version 10 void dot_product(double* a, double* b, double* c, int size) { 11 for (int i=0; i<size; i++) { 12 c += a[i] * b[i]; int main() { 21 #define SIZE = 1024; 22 double a[size], b[size], c; 23 for (int i = 0; i < SIZE; ++i) { 24 a[i] =, b[i] = ; dot_product(a, b, c, SIZE); return 0; 30 18

19 Dot Product of Vectors Intel ArBB version: Using arbb::bind void dot_product(double* a, double* b, double* c, int size) { 19 for (int i=0; i<size; i++) { int main() { c += a[i] * b[i]; #define SIZE = 1024; double a[size], b[size], c; for (int i = 0; i < SIZE; ++i) { a[i] =, b[i] = ; dot_product(a, b, c, SIZE); return 0; void dot_product(const dense<f64>& a, const dense<f64>& b, f64& c) { c = add_reduce(a * b); int main() { #define SIZE = 1024; ARBB_CPP_ALIGN(double a[size]); ARBB_CPP_ALIGN(double b[size]); ARBB_CPP_ALIGN(double c[size]); for (int i = 0; i < SIZE; ++i) { a[i] = ; b[i] = ; dense<f64> va, vb, vc; bind(va, a, SIZE); bind(vb, b, SIZE); bind(vc, c, SIZE); call(dot_product)(va, vb, vc); for (int i = 0; i < SIZE; ++i) { std::cout << c[i] << std::endl; return 0;

20 Data Binding arbb::bind() Bind Intel ArBB containers to C/C++ data The bound containers can be operated by Intel ArBBfunctions Use arbb::bind() when you already have data allocated using C/C++ types You can bind 1D, 2D, or 3D dense containers You can bind containers of user defined types You can bind a portion of a C/C++ array You can also bind non-consecutive elements in a C/C++ array 20

21 Dot Product of Vectors Intel ArBB version: Using arbb::range void dot_product(double* a, double* b, double* c, int size) { for (int i=0; i<size; i++) { int main() { c += a[i] * b[i]; #define SIZE = 1024; double a[size], b[size], c; for (int i = 0; i < SIZE; ++i) { a[i] =, b[i] = ; dot_product(a, b, c, SIZE); void dot_product(const dense<f64>& a, const dense<f64>& b, f64& c) { c = add_reduce(a * b); int main() { #define SIZE = 1024; dense<f64> a, b; f64 c; range<f64> range_a = a.write_only_range(); range<f64> range_b = b.write_only_range(); for (int i = 0; i < SIZE; ++i) { range_a[i] = ; range_b[i] = ; 21 return 0; call(dot_product)(a, b, c); std::cout << value(c) << std::endl; return 0;

22 Accessing Containers as a Range Range allows accessing containers as if they were plain C/C++ data Use range when your data is allocated in the Intel ArBB space Range provides operator[ ] to index into containers Range also provides random access iterators 22

23 Heat Dissipation (Algorithm) Stencil Solution cells Data structure: 2D grid (N x M cells) Boundary cells Algorithm: Sweep over the grid Update nonboundary cells Read cells N, S, E, and W of the current cell Take the average of the value Boundary conditions 23

24 10 void apply_stencil(double** grid1, double** grid2) { 11 for (int iter = 0; iter < ITER; iter++) { 12 stencil(grid1, grid2); 13 tmp = grid1; 14 grid1 = grid2; 15 grid2 = tmp; void stencil(double** src, double** dst) { 21 for (int i = 1; i < SIZE-1; i++) { 22 for (int j = 1; j < SIZE-1; j++) { 23 dst[i][j] = 0.25*(src[i+1][j] + src[i-1][j]+ 24 src[i][j+1] + src[i][j-1]); 25 Heat Dissipation Plain C version After each sweep, swap source and destination grid. Run ITER sweeps over the 2D grid. For each grid cell apply stencil. 24

25 Heat Dissipation Intel ArBB version 10 void apply_stencil(dense<f64, 2>& grid, dense<f64, 2>& swap) { 11 _for(i32 i = 0, i < ITER, ++i) { 12 _if(0 == (i&1)) 13 map(stencil)(grid, swap); 14 _else { 15 map(stencil)(swap, grid); 16 _end_if 17 _end_for 18 Run ITER sweeps over the 2D grid. 20 void stencil(f64 src, f64& dst) { 21 arbb::array<usize, 2> coord; 22 position(coord); 23 usize x = coord[0], usize y = coord[1]; 24 _if(x == 0 y == 0 x == WIDTH-1 y == HEIGHT-1) { 25 dst = src; 26 _else { 27 dst = 0.25 * (neighbor(src, -1, 0) + neighbor(src, 1, 0) + 28 neighbor(src, 0, -1) + neighbor(src, 0, 1)); 29 _end_if 30 Test for boundary cells apply stencil. 25

26 Highlights from this Code Implement a kernel to operator on a single stencil Apply the kernel across all stencils using the arbb::map() operator Inside the kernel: Use arbb::position() to get the coordinates of the current stencil Use arbb::neighbor() to get the coordinates of the neighboring stencils 26

27 Heat Dissipation Intel ArBB version: A better solution 10 void apply_stencil(dense<f64, 2>& grid) { 11 _for(i32 i = 0, i < ITER, ++i) { 12 map(stencil)(grid); 13 _end_for 14 Run ITER sweeps over the 2D grid. 20 void stencil(f64& cell) { 21 arbb::array<usize, 2> coord; 22 position(coord); 23 usize x = coord[0], usize y = coord[1]; 24 _if(x!= 0 && y!= 0 && x!= WIDTH-1 && y!= HEIGHT-1) { 25 cell = 0.25 * (neighbor(cell, -1, 0) + neighbor(cell, 1, 0) + 26 neighbor(cell, 0, -1) + neighbor(cell, 0, 1)); 27 _end_if 28 Test for boundary cells apply stencil. 27

28 Highlights from this Code Almost same to the previous version, but The stencil uses a single parameter for both input and output The ArBB runtime and memory manager take care of the shadow copy 28

29 Heat Dissipation Intel ArBB version: A even better solution 10 void apply_stencil(dense<f64, 2>& grid) { 11 _for(i32 i = 0, i < ITER, ++i) { 12 map(stencil)(grid); 13 _end_for 14 Run ITER sweeps over the 2D grid. 20 void stencil(f64& cell) { 21 cell = 0.25 * (neighbor(cell, -1, 0) + neighbor(cell, 1, 0) + 22 neighbor(cell, 0, -1) + neighbor(cell, 0, 1)); void dissipation(dense<f64, 2>& grid) { 31 arbb::array<usize, 2> sizes = grid.size(); 32 dense<f64, 2> interior = section(grid, 33 1, sizes[0] 2, 34 1, sizes[1] 2); 35 apply_stencil(interior); grid = replace(grid, 1, sizes[0] 2, 1, sizes[1] 2, interior); 38 apply stencil. No explicit testing for boundary cells. 29

30 Highlights from this Code Very clean code No if-else to handle boundary cells Making it easier for the Intel ArBB runtime to do more optimizations 30

31 Key Points 1. High level of abstraction 2. Automatic threading and vectorization 3. Modularity of code 4. Guaranteed race free, deterministic application 5. Code as if it is serial single core 31

32 Register and Download Now: Read Documentation: Ask question or share product usage at forum Read Knowledge Base How to get started Attend advanced Intel ArBB webinar (Nov) 32

Intel Parallel Building Blocks (Intel PBB) What Is It?

performance for the latest CPU features What to Use?

and data parallelism Intel Threading Building Blocks

Building Blocks Sophisticated library for data

33 Intel Parallel Building Blocks (Intel PBB) What Is It? Performance Unleashed Tools to optimize application performance for the latest CPU features What to Use? Intel Cilk Plus Compiler extensions to simplify task and data parallelism Intel Threading Building Blocks C++ template library for task parallelism Intel Array Building Blocks Sophisticated library for data parallelism Benefit Mix and Match to optimize your app s performance 33

Intel s Family of Parallel Models Intel Parallel Building Blocks (PBB) Fixed Function Libraries Established Standards Research and Exploration Intel Threading Building Blocks (TBB)

34 Intel s Family of Parallel Models Intel Parallel Building Blocks (PBB) Fixed Function Libraries Established Standards Research and Exploration Intel Threading Building Blocks (TBB) Intel Array Building Blocks (ArBB) Intel Cilk Plus Intel Math Kernel Library (MKL) Intel Integrated Performance Primitives (IPP) MPI OpenMP* Intel Concurrent Collections OpenCL* 34

35 Other Reference Recorded webinars on Intel PBB Read About More Parallelism: Give us feedback Forum and Beta Survey 35

37 Backup Slides Scalar types Container types Declaration and initialization Range, binding, copy_in/copy_out Flow controls 37

38 Scalar types Scalar types provide equivalent functionality to the scalar types built into C/C++ Types Description C++ equivalents f32, f64 32/64 bit floating point number float, double i8, i16, i32, i64 8/16/32 bit signed integers char, short, int u8, u16, u32, u64 8/16/32 bit unsigned integers unsigned char/short/int boolean Boolean value (true or false) bool isize, usize Signed/unsigned integers sufficiently large to store addresses size_t 38

39 Containers regular containers irregular containers dense<t> dense<t, 2> nested dense<t,3> array< > dense<array< >> 39

40 Declaration and Construction Declaration Element type Dimensionality Size dense<f32> a; f dense<i32, 2> b; i32 2 0, 0 dense<f32> c(1000); f dense<f32> d(c); f dense<i8, 3> e(5, 3, 2); i8 3 5, 3, 2 40

41 Moving Data into and out of Containers Dense containers provide three ways to access data: Data copy operations copy_in to copy data into the container copy_out to copy out of the container Iterators read_only_range iterator to read from the container write_only_range iterator to write into the container read_write_range iterator to write/read a container Binding On construction, dense containers can be bound (associated) to a particular data location Moves data into and out of that location when required 41

42 Filling dense Containers // request write access to container dense<f32> a(1024); range<f32> range_a = a.write_only_range(); std::fill(range_a.begin(), range_a.end(), static_cast<f32>(1)); // request read/write access to container dense<f32> b(1024); range<f32> range_b = b.read_write_range(); std::fill(range_b.begin(), range_b.end(), static_cast<f32>(2)); 42

$Loops For loop _for (begin, end, step) { // note use of commas, not semicolons!$

43 Loops For loop _for (begin, end, step) { // note use of commas, not semicolons! /* code */ _end_for; // note use of termination keyword Example _for (i32 i=0, i<=n, i++) { /* code */ _end_for; 43

$Loops While loop _while (condition) { /* code */ _end_while; Supporting$

44 Loops While loop _while (condition) { /* code */ _end_while; Supporting statements: Exit loop with _break Skip remainder of current iteration with _continue 44

45 Conditionals if statement _if (condition){ /* code */ _end_if; if statement with else if _if (condition){ /* code */ _else_if { _else { /* code */ /* code */ _end_if; if statement with else _if (condition){ _else { /* code */ /* code */ _end_if; 45

Intel Array Building Blocks

Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call