REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++

Size: px

Start display at page:

Download "REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++"

Kathryn Fletcher
5 years ago
Views:

1 REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++ J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid January 21, 2015 J. Daniel Garcia 1/34

2 Introduction 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 2/34

3 Introduction Context The end for the free-lunch era. J. Daniel Garcia 3/34

4 Introduction Context The end for the free-lunch era. New architectures with diversity in computing elements. Multi-cores, GPUs, DSPs, FPGAs. J. Daniel Garcia 3/34

5 Introduction Context The end for the free-lunch era. New architectures with diversity in computing elements. Multi-cores, GPUs, DSPs, FPGAs. A switch of focus. From performance centric serial computations, to energy efficient parallel computations. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 3/34

6 Introduction Context The end for the free-lunch era. New architectures with diversity in computing elements. Multi-cores, GPUs, DSPs, FPGAs. A switch of focus. From performance centric serial computations, to energy efficient parallel computations. Programming heterogeneous parallel architectures: Lack of unified programming model for diverse devices. Need to maximize performance and energy efficiency. Costly development process porting to multiple devices. Need to modernize existing legacy code bases. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 3/34

7 Introduction The REPARA Vision Vision The REPARA project aims to help in the transformation and deployment of new and legacy applications in parallel heterogeneous computing architectures while maintaining balance between: application performance, energy efficiency, and source code maintainability. J. Daniel Garcia 4/34

8 Introduction A workflow for code transformations J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 5/34

9 Introduction Problems to be solved (I) Source code preparation: Adaptation of legacy code. Enforcement of C++ subset (REPARA-C++). J. Daniel Garcia 6/34

10 Introduction Problems to be solved (I) Source code preparation: Adaptation of legacy code. Enforcement of C++ subset (REPARA-C++). Application partitioning: Describe the target platform. Map software components to specific computing devices. J. Daniel Garcia 6/34

11 Introduction Problems to be solved (I) Source code preparation: Adaptation of legacy code. Enforcement of C++ subset (REPARA-C++). Application partitioning: Describe the target platform. Map software components to specific computing devices. Transformation analysis: Identify transformation opportunities. Generation of an Abstract Intermediate Representation (REPARA-AIR). J. Daniel Garcia 6/34

12 Introduction Problems to be solved (II) Source code transformation: Interactive and non-interactive refactoring. Automated transformation to FPGA. J. Daniel Garcia 7/34

13 Introduction Problems to be solved (II) Source code transformation: Interactive and non-interactive refactoring. Automated transformation to FPGA. Runtime engineering: Coordination of software components (FastFlow). Manage statically and dynamically partitioned applications. J. Daniel Garcia 7/34

14 Introduction Problems to be solved (II) Source code transformation: Interactive and non-interactive refactoring. Automated transformation to FPGA. Runtime engineering: Coordination of software components (FastFlow). Manage statically and dynamically partitioned applications. Continuous evaluation: Prediction and monitoring of performance and energy. Evalutation of software maintainability. J. Daniel Garcia 7/34

15 Source code preparation 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 8/34

16 Source code preparation Legacy code C++ is an evolving language, but highly backwards compatible. And it has a C subset. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 9/34

17 Source code preparation Legacy code C++ is an evolving language, but highly backwards compatible. And it has a C subset. In the context of parallel heterogeneous architectures a software component may run at: Host side: Runs at the CPU. Supports all the ISO C++11 features. Device side: Runs on device (GPU, FPGA, DSP). Restricted subset from ISO C++11. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 9/34

18 Source code preparation GPU: Examples Forbid the use of bit-fields data members in structures. Disallow the use of VLAs and flexible array data members. GPU software components cannot use dynamic type binding. virtual member functions. Virtual inheritance. Memory management through new/delete cannot be used. GPU software components cannot make use of exceptions. Restricted version of the C++ standard library. J. Daniel Garcia 10/34

19 Source code preparation FPGA: Examples System calls are not supported. Pointer casting is not allowed unless it is between native C types. Recursive functions are not allowed. Arrays of pointers are supported only if each pointer points to a scalar or array of scalars. Dynamic memory management is not supported. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 11/34

Source code preparation Cevelop: An IDE for REPARA http://www.cevelop.

20 Source code preparation Cevelop: An IDE for REPARA Based on Eclipse IDE. Already includes some refactorings for C/C++ improvement. Future versions: Application partitioning. Transformation analysis. Source code transformation. J. Daniel Garcia 12/34

21 Application partitioning 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 13/34

22 Application partitioning Platform descriptions HPP-DL: A description language to represent all elements from a parallel heterogeneous platform. JSON based. A platform description includes information about hardware and other platform specific information (e.g. I/O ports, IRQs,... ). J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 14/34

23 Application partitioning HPP-DL: example { / Metainformation of HPP DL / class : hpp, description : Human readable description, version : 1.0, date : :00, components :[ / Definition of harware platform / { class : platform, id : platform :0, description : REPARA Reference System. X9DRG QF (To be filled by O.E.M.), model : X9DRG QF, vendor : Supermicro Inc., numa nodes : 2, processors : 2, cores : 4, pu num : 8, global mem size : 16 GiB, capabilities : [] }, / Definition of processor 0 / { class : processor,... J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 15/34

24 Application partitioning Kernel anotations through attributes Source code can be annotated to identify kernels from an application. Many of this annotations can be automatically generated. May be manually refined. J. Daniel Garcia 16/34

25 Application partitioning Kernel anotations through attributes Source code can be annotated to identify kernels from an application. Many of this annotations can be automatically generated. May be manually refined. Example C++ offers attributes as an alternate to traditional pragmas. Less verbose annotation mechanism in some situations. Better integration with language syntax. [[ rpr :: kernel, rpr :: target (CPU,GPU), rpr::in(a,b,n,data), rpr :: out(c)]] for (int i=0; i<n; ++i) for (int j=0; j<i; ++j) C[i ] = A[i ] B[j ] + data; J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 16/34

26 Application partitioning Static partitioning J. Daniel Garcia 17/34

27 Absolute(time(in(log(scale(Sms. Relative(time( in(percentage REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++ Application partitioning Evaluation: Transitive closure NIIIIIIVI NIIIIIVI NIIIIVI NIIIVI NIIVI NIVI NVI Comparison(of(execution(times NIIf 9If 8If 7If 6If 5If 4If 3If 2If NIf If ABSUMAX ABSUVALUE ABSUMIN RELUVALUE Problem(size(Sbase(size. J. Daniel Garcia 18/34

28 Transformation analysis 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 19/34

29 Transformation analysis Identifying opportunities Kernels can be identified: Manually by the programmer. By an automated tool. J. Daniel Garcia 20/34

30 Transformation analysis Identifying opportunities Kernels can be identified: Manually by the programmer. By an automated tool. Identifying a kernel requires: Identify input and output parameters. Identify target devices where the kernel is valid. Identify size parameters. Additionally: More sophisticated properties. Patterns (e.g. pipeline, farm,... ). J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 20/34

31 Transformation analysis An Abstract Intermediate Representation Transformations of software components to multiple programming models. Adopt common strategy in compiler technology: a front-end and a back-end. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 21/34

32 Transformation analysis An Abstract Intermediate Representation Transformations of software components to multiple programming models. Adopt common strategy in compiler technology: a front-end and a back-end. front-end: Identifies transformation opportunities. Adds meta-data to original source code. Generates an abstract intermediate representation. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 21/34

33 Transformation analysis An Abstract Intermediate Representation Transformations of software components to multiple programming models. Adopt common strategy in compiler technology: a front-end and a back-end. front-end: Identifies transformation opportunities. Adds meta-data to original source code. Generates an abstract intermediate representation. back-end: Multiple back-ends to transform to different programming models. One additional back-end for FPGA. J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 21/34

34 From attributes to run-time 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 22/34

35 From attributes to run-time Simple kernels int main() { constexpr size t size a = 256, size b = 32; std :: vector<long> A(size a), B(size b); [[ rpr :: kernel, rpr :: in(a[]), rpr :: out(a[]), rpr :: target (CPU)]] for ( size t i=0; i<a.size(); ++i) A[i ] = F(i ) ; [[ rpr :: kernel, rpr :: in(b[]), rpr :: out(b[]), rpr :: target (CPU)]] for ( size t i=0; i<b.size(); ++i) B[i ] = G(i) ; [[ rpr :: kernel, rpr :: in(a [], B[]), rpr :: out(x), rpr :: target (CPU)]] long x = H(A,B); std :: cout << x << std::endl; return 0; } J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 23/34

36 From attributes to run-time Transformation of simple kernels int main() { constexpr size t size a = 256, size b = 32; std :: vector<long> A(size a), B(size b); ff :: ParallelFor pf; pf. parallel for (0, A.size(), [&A](long i) { A[i ] = F(i ) ; }); pf. parallel for (0, B.size(), [&B](long i) { B[i ] = G(i) ; }); long x = H(A,B); std :: cout << x << std::endl; return 0; } J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 24/34

37 From attributes to run-time Matrix Vector multiplication int main() { vector<float> M( ), V(1024), R(16); [[ rpr :: kernel, rpr :: out(m[ ]), rpr:: target (CPU)]] for ( size t i=0; i<16; ++i) for ( size t j=0; j<1024; ++j) M[i 1024+j] = static cast<float>(i 1024+j+1); [[ rpr :: kernel, rpr :: out(v[1024]), rpr :: target (CPU)]] for ( size t i=0; i<1024; ++i) V[j]=1.0; [[ rpr :: kernel, rpr :: in(m[ ], V[1024]), rpr::out(r[16]), rpr :: target (GPU)]] for ( size t i=0; i<16; ++i) { float sum = 0.0; for ( size t j=0; j<1024; ++j) sum += M[i 1024+j] V[j]; R[i ] = sum; } print result (R,16); } J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 25/34

38 From attributes to run-time Pipelines... [[ rpr :: pipeline, rpr :: stream(b,c)]] for ( size t y=0; y<max; ++y) { [[ rpr :: kernel, rpr :: out(b[]) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) B[i N + j ] = y + float{i+j}; [[ rpr :: kernel, rpr :: in(b[]), rpr :: out(c[]) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) for ( size t k=0; j<n; ++k) C[i N+j] = A[i N+k] B[k N+j] [[ rpr :: kernel, rpr :: in(c[]), rpr :: out(r[]) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) R[i N+j] = C[j N+i] }... J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 26/34

39 From attributes to run-time Farm [[ rpr :: pipeline, rpr :: stream(b,c)]] for ( size t y=0; y<max; ++y) { [[ rpr :: kernel, rpr :: out(b[]), rpr :: farm(2) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) B[i N + j ] = y + float{i+j}; [[ rpr :: kernel, rpr :: in(a [], B[]), rpr :: out(c[]), rpr :: farm() ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) for ( size t k=0; j<n; ++k) C[i N+j] = A[i N+k] B[k N+j] [[ rpr :: kernel, rpr :: in(c[]), rpr :: out(r[]) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) R[i N+j] = C[j N+i] } J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 27/34

40 From attributes to run-time Map [[ rpr :: pipeline, rpr :: stream(b,c)]] for ( size t y=0; y<max; ++y) { [[ rpr :: kernel, rpr :: out(b[]), rpr :: map(2)]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) B[i N + j ] = y + float{i+j}; [[ rpr :: kernel, rpr :: in(a [], B[]), rpr :: out(c[]), rpr :: map()]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) for ( size t k=0; j<n; ++k) C[i N+j] = A[i N+k] B[k N+j] [[ rpr :: kernel, rpr :: in(c[]), rpr :: out(r[]) ]] for ( size t i=0; i<n; ++i) for ( size t j=0; j<n; ++j) R[i N+j] = C[j N+i] } J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 28/34

41 From attributes to run-time Reductions long r = std :: numeric limits<long>::min(); [[ rpr :: kernel, rpr :: in(a[]), rpr :: reduce(max,r)]] for ( size t i=0; i<n; ++i) r = std :: max(a[i], r) ; std :: cout << r << std::endl; J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 29/34

42 From attributes to run-time Asynchronous execution [[ rpr :: kernel, rpr :: async]] for ( size t i=0;i<n; ++i) for ( size t j=0; j<n; ++j) { A[i N+j] = float{i+j}; B[i N+j] = float{abs(j i)}; } [[ rpr :: kernel, rpr :: async]] for ( size t i=0; i<n; ++i) V[i ] = float{i}; [[ rpr :: sync]] ; // Explicit sync J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 30/34

43 From attributes to run-time Asynchronous execution [[ rpr :: kernel, rpr :: async]] for ( size t i=0;i<n; ++i) for ( size t j=0; j<n; ++j) { A[i N+j] = float{i+j}; B[i N+j] = float{abs(j i)}; } [[ rpr :: kernel, rpr :: async]] for ( size t i=0; i<n; ++i) V[i ] = float{i}; [[ rpr :: kernel ]] // Implicit sync f () ; J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 31/34

44 Summary 1 Introduction 2 Source code preparation 3 Application partitioning 4 Transformation analysis 5 From attributes to run-time 6 Summary J. Daniel Garcia ARCOS@UC3M (josedaniel.garcia@uc3m.es) 32/34

45 Summary Summary Performance, energy efficiency and source code maitainability need to be balanced. Legacy code needs to be considered. Much more legacy code than new code out there. Refactoring C++ code to: Enforce specific device rules. Apply transformations to specific programming models. Generate FPGA code. Application partitioning using HW description, kernel measurements and code. Transformation combining front-end and back-end. C++ attributes to enrich code with annotations. J. Daniel Garcia 33/34

46 Summary REPARA: Reeengineering for Heterogeneous Parallelism for Performance and Energy in C++ J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid January 21, 2015 J. Daniel Garcia 34/34

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor