x Welcome to the jungle. The free lunch is so over

Size: px

Start display at page:

Download "x Welcome to the jungle. The free lunch is so over"

Beverley Greene
6 years ago
Views:

1 Herb Sutter

1975-2005 Put a computer on every desk, in every home, in every pocket.

2 Put a computer on every desk, in every home, in every pocket. The free lunch is so over Put a parallel supercomputer on every desk, in every home, in every pocket. Welcome to the jungle x Put a heterogeneous supercomputer on every desk, in every home, in every pocket.

5 Processors Memory

6 Processors Xbox 360 & mainstream computer AMD GPU AMD 80x86 Fusion APU Other GPU Athlon Phenom II Memory

7 Processors Xbox 360 & mainstream computer AMD GPU AMD 80x86 Athlon Fusion APU Phenom II Memory Other GPU Microsoft Azure Cloud Computing Cloud + GPU

8 Processors Memory

9 Processors (GP)GPU Cloud IaaS/HaaS ISO ISO C++0x Multicore CPU Memory

11 Processors (GP)GPU Cloud IaaS/HaaS ISO C++0x C++ PPL Multicore CPU Memory

12 Processors? DirectCompute (GP)GPU Cloud IaaS/HaaS ISO C++0x C++ PPL Multicore CPU Memory

13 Processors C++ AMP DirectCompute (GP)GPU Accelerated Massive Parallelism Cloud IaaS/HaaS ISO C++0x C++ PPL Multicore CPU Memory

14 Convert this (serial loop nest) void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { for (int y = 0; y < M; y++) for (int x = 0; x < N; x++) { float sum = 0; for(int i = 0; i < W; i++) sum += A[y*W + i] * B[i*N + x]; C[y*N + x] = sum; } }

$vector<float>& A, const vector<float>& B, for (int y = 0; y < M; y++) int M, int N, int W ) for (int { x = 0; x < N; x++) { float sum array_view<const = 0; float,2> a(m,w,a), b(w,n,b); for(int i$

15 Convert this (serial loop nest) void MatrixMult( to this float* (parallel C, const loop, vector<float>& CPU GPU) A, const vector<float>& B, int M, int N, int W ) { void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, for (int y = 0; y < M; y++) int M, int N, int W ) for (int { x = 0; x < N; x++) { float sum array_view<const = 0; float,2> a(m,w,a), b(w,n,b); for(int i array_view<writeonly<float>,2> = 0; i < W; i++) c(m,n,c); sum += parallel_for_each( A[y*W + i] * B[i*N c.grid, + x]; [=](index<2> idx) restrict(direct3d) { C[y*N + x] float = sum; = 0; } for(int i = 0; i < a.x; i++) } sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; } ); }

Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Fusion

programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute Driver-based APIs

based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes

17 Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era Architected Era Fusion System Architecture GPU Peer Processor Proprietary Drivers Era Graphics & Proprietary Driver-based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching The Programmer s Guide to the APU Galaxy June 2011

18 Processors Memory

19 Single-core to multi-core ISO C++0x? PPL Parallel Patterns Library (VS2010)

20 ISO C++0x forall( x, y ) forall( z; w; v ) forall( k, l, m, n )...? Single-core to multi-core PPL Parallel Patterns Library (VS2010)

21 ISO C++0x λ parallel_for_each( items.begin(), items.end(), [=]( Item e ) { your code here } ); Single-core to multi-core PPL Parallel Patterns Library (VS2010)

22 1 language feature for multicore and STL, functors, callbacks, events,...

23 ? ISO C++0x Multi-core to hetero-core C++ AMP Accelerated Massive Parallelism

24 Multi-core to hetero-core ISO C++0x restrict parallel_for_each( items.grid, [=](index<2> i) restrict(direct3d) { your code here } ); C++ AMP Accelerated Massive Parallelism

25 1 language feature for heterogeneous cores

26 Processors Memory

27 Problem: Some cores don t support the entire C++ language. Solution: General restriction qualifiers enable expressing language subsets within the language. Direct3d math functions in the box. Example double sin( double ); // 1a: general code double sin( double ) restrict(direct3d); // 1b: specific code double cos( double ) restrict(direct3d); // 2: same code for either parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { sin( data.angle ); // ok, chooses overload based on context cos( data.angle ); // ok });

28 Initially supported restriction qualifiers: restrict(cpu): The implicit default. restrict(direct3d): Can execute on any DX11 device via DirectCompute. Restrictions follow limitations of DX11 device model (e.g., no function pointers, virtual calls, goto). Potential future directions: restrict(pure): Declare and enforce a function has no side effects. Great to be able to state declaratively for parallelism. General facility for language subsets, not just about compute targets.

29 Problem: Memory may be flat, nonuniform, incoherent, and/or disjoint. Solution: Portable view that works like an N-dimensional iterator range. Future-proof: No explicit.copy()/.sync(). As needed by each actual device. Example void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { array_view<const float,2> a(m,w,a), b(w,n,b); // 2D view over C array array_view<writeonly<float>,2> c(m,n,c); // 2D view over C++ std::vector } parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { } );

30 TM

31 Bring CPU debugging experience to the GPU

32 Bring CPU debugging experience to the GPU

35 TM

36 Cloud GPU # cores, not counting SIMD Cloud OoO GPU InO CPU OoO CPU

37 Cloud GPU # cores, not counting SIMD Cloud OoO Welcome to the jungle GPU The free lunch is so over InO CPU OoO CPU

38 Processors Memory

39 C++ PPL: 9:45am C++ AMP: 2:00pm, Room 406 Herb Sutter

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU