Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Size: px

Start display at page:

Download "Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany"

Milton Williamson
5 years ago
Views:

1 Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

2 Agenda Introduction to technology Cell programming models SPE runtime management library (libspe2) OpenMP and CBEXLC compiler Cell SuperScalar (CSs) OpenCL T-Platforms Cell Compiler Accelerated Library Framework (ALF) Data Communication and Synchronization (DaCS) hybrid models Summary FFT on CellSs 2

3 Introduction to technology Multicore, heterogeneous design PPE and 8 SPEs,( 3.2GHz ) SIMD cores ( 256KB ) Main Memory (max. 16GB) Local Store Cell based computing: Cell clusters Cell + x86 clusters Cell + POWER clusters RoadRunner-like systems Novel systems: QPACE - supercomputersdesktop Cell based workstations QS22 blade 3

4 ICM Nautilus cluster 80 IBM QS22 nodes 20 IBM LS21 nodes 4x DDR Infiniband Green500 1st place Nov 08 1st place Jun Mflops/Watt PlayStation3 1 console 2 guitars 1 mic 1 drums set Unofficial leaders Joint Cell Competence Center (IBM & ICM) Application enablement on Cell 4

5 Cell programming models Libraries Libspe2 Data Communication and Synchronization (DaCS) Accelerated Library Framework Single source CellSs OpenMP Optimization techniques asynchronous DMA transfers double-buffering SIMDization loop unrolling memory alignment assembler optimizations (AsmViz) Auto-parallelization T-Platforms compiler Novel standars OpenCL 5

6 Performance comparison How to compare performance of Cell implementations? Compare a reference x86 to Cell rather than PPE to Cell Computations on PPE are usually 2-3x slower than x86 Number of threads view? 16 SPEs vs 16 x86 cores Accelerator view 16 SPEs vs 1 x86 core 16 SPEs + x86 core vs 1 x86 core 6

7 SPE runtime management library (libspe2) The SPE runtime management library (libspe2) contains an SPE thread programming model for Cell BE applications Constitutes the standardized low-level application programming interface (API) for application access to the Cell/B.E. SPEs. Libspe2 is used to control SPE program execution from the PPE program Handles SPEs as virtual objects called SPE contexts. SPE programs can be loaded and executed by operating SPE contexts The elfspe is a PPE program that allows an SPE program to run directly from a Linux command prompt without needing a PPE application to create an SPE thread and wait for it to complete. 7

8 SPE programming PPE PPE Thread... sleep... SPE SPE Thread... work... PPE PPE Thread 0 PPE Thread 1 SPE SPE Thread... work... 8

9 PPE & SPE Synergistic Programming PPE Code #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <libspe2.h> extern spe_program_handle_t hello_spu; int main(void) {... rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info);... } SPE Code #include <stdio.h> int main(unsigned long long speid, unsigned long long argp, unsigned long long envp) { printf("hello world!\n"); return 0; } 9

10 PPE scheduling work for multiple SPEs PPE Thread 0 PPE PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread SPE SPE SPE Thread SPE SPE... Thread SPE SPE work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE... work... Thread... work work... 10

11 Libspe2 functionality Develop 2 programs: PPE program, SPE program Programmer must take care of: Implementation of the pthreads scheme Communication: DMA transfers (get,put) Mailboxes short messaged Optimization of the SPE code but, consumingtimeanddifficultislibspe2 withcodesdeveloping.. it gives us full control of the Cell application i.e. implementation of pipeline parallel schemes where each SPE has its own function 11

on QS22 against reference x86 implementation Computational task: periodicity searching in

12 Example work: periodicity searching algorithms We have ported the code used in OGLE (The Optical Gravitational Lensing Experiment) project with the use of libspe2 achieving speedup of more than 19x on QS22 against reference x86 implementation Computational task: periodicity searching in observational data PPE manages work and I/O I/O overlapped with computations Additional PPU thread used for computations 12

13 OpenMP and CBEXLC compiler XL C/C++ for Multicore Acceleration for Linux supports OpenMP cbexlc o program.exe qsmp=omp program.c Can be used to parallelize simple loops Memory intensive computations on large shared tables achieve low performance due to low single SPE performance 13

14 Cell SuperScalar (CSs) task based programming model single source, directives runtime scheduler Supported libspe2&cell functionality: DMA transfers mailboxes SIMD instructions #pragma css task input(a,b) inout(c) void matvec(float *a,float *b,float *c) { int i,j; } for(i=0;i<n;i++) for(j=0;j<n/b;j++) c[j]=a[i*n/b+j]*b[i]; #pragma css start for(i=0;i<n;i+=b) matvec(a+i*n,b,c+i); #pragma css finish 14

15 OpenCL (version 0.1.1) December 1, 2009 OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. Platform requirements: IBM BladeCenter QS22 systems running Fedora 9 IBM BladeCenter JS23 systems running Red Hat Enterprise Linux 5.3 provides a full profile Power/VMX CPU device as well as an embedded profile SPU accelerator device The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB. Special care should be taken during simultaneous use as both memory and compute resources are shared Global memory is shared between the devices, which means memory consumption on one effects the availability on both 15

16 T-Platforms Cell Compiler Single source compiler for C, C++ and Fortran Auto-parallelization and auto-vectorization of source code The current implementation of Cell Compiler parallelizes loops only Benchmarked with NAS Parallel Benchmark (Embarrassingly Parallel (EP) and Conjugate Gradient (CG) problem) Outperforms OpenMP implementation (cbexlc) Beta version available for testing purposes Benchmarks on bigger codes? 16

17 gcc -O3 --fast-math sincos.c -o sincos.gcc -lm T-Platforms Cell Compiler how to use it? Write your code PPU #define N for (i=0; i<n; i++) c[i]=sin(a[i])+cos(b[i]);.. gcc -O3 --fast-math sincos.c -o sincos.ppu lm Computation time = /opt/utlcc/bin/utlcc -O3 --fast-math sincos.c -o sincos.utlcc -lm -Ws,--trace-paral UTLCC TRACE: Try to parallelize loop (#1) TRACE: Source: somewhere at sincos.c(27:28) TRACE: SUCCESS x86 Computation time = gcc -O3 --fast-math sincos.c -o sincos.x86 -lm Computation time =

Accelerated Library Framework (ALF) ( ALF ) Accelerated Library Framework Provides a simple user-level programming framework for Cell library developers.

18 Accelerated Library Framework (ALF) ( ALF ) Accelerated Library Framework Provides a simple user-level programming framework for Cell library developers. task management, data transfer, double buffering, data communication Supports the multiple-program-multiple-data (MPMD) parallel programming style where several programs run on different SPEs at one time Supports the scatter/gather model provided by CBE DMA list operation Two Implementations ALF Cell Between PPU and SPU ALF Hybrid Between X86_64 and PPU Application Develop programs only at the host level. Use the provided ALF libraries. Accelerated library Use the ALF API to provide the library interfaces Computational kernel Write optimized accelerator code Examples: BLAS, LAPACK Application Developer Library programmer Cell programmer 18

19 ALF Workflow 19

20 Example: SinCos computations for (i = 0; i < NUM_ROW; i++) for (j = 0; j < NUM_COL; j++) mat_c[i*num_col+j] = sin(mat_a[i*num_col+j]) + cos(mat_b[i*num_col+j]); We want to create a library that would enable us to exchange these two loops into a simple library call: sincosfun_alf(mat_a,mat_b,mat_c,num_row,num_col); 20

21 Example: SinCos computations (PPU) alf_init(null, &alf_handle); alf_query_system_info(alf_handle,alf_query_num_accel,0, &nspus); alf_num_instances_set(alf_handle, nspus); alf_task_desc_create(alf_handle, This is only the iniitialization 0, step: &task_desc_handle); alf_task_desc_set_int32(task_desc_handle, context size ALF_TASK_DESC_TSK_CTX_SIZE, 0); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(sizes_t)); workblock size (in, out) alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, maximum stack size ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, image to be loaded ALF_TASK_DESC_NUM_DTL_ENTRIES, 32); alf_task_desc_set_int32(task_desc_handle, computational kernel ALF_TASK_DESC_MAX_STACK_SIZE, 3*8192); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long)"sincosfun_spu"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long)"libsincosfun_spu.so"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long)"comp_kernel"); 21

22 Example: SinCos computations (PPU) alf_task_create(task_desc_handle, NULL, nspus, 0, 0, &task_handle); for (i = 0; i < NUM_ROW; i += H) { alf_wb_create(task_handle, ALF_WB_SINGLE, 0, &wb_handle); Create the input buffer alf_wb_dtl_begin(wb_handle, ALF_BUF_IN, 0); alf_wb_dtl_entry_add(wb_handle, &mat_a[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_entry_add(wb_handle, &mat_b[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); alf_wb_dtl_begin(wb_handle, ALF_BUF_OUT, 0); alf_wb_dtl_entry_add(wb_handle, &mat_c[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); Create the output buffer alf_wb_parm_add(wb_handle, (void *) (&sizes), sizeof(sizes), ALF_DATA_BYTE, 0); alf_wb_enqueue(wb_handle); } Add parameters and enqueue the WB. alf_task_finalize(task_handle); This starts the execution 22

23 Example: SinCos computations (SPU) int comp_kernel(void *p_task_context, void *p_sizes_context, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) { unsigned int i, cnt; vector float *sa, *sb, *sc; sizes_t *size = (sizes_t *) p_sizes_context; cnt = size->h * size->v / 4; // vector of 4 sa = (vector float *) p_input_buffer; sb = sa + cnt; sc = (vector float *) p_output_buffer; Input and output buffers available as task parameters for (i = 0; i < cnt; i += 4) { sc[i] = spu_add(sinf4(sa[i]), cosf4(sb[i])); sc[i+1] = spu_add(sinf4(sa[i+1]), cosf4(sb[i+1])); sc[i+2] = spu_add(sinf4(sa[i+2]), cosf4(sb[i+2])); sc[i+3] = spu_add(sinf4(sa[i+3]), cosf4(sb[i+3])); } return 0; } SIMD computations 23

24 ALF summary Example application: time PPU: 2.69s time ALF (8 SPUs): 0.24s ALF works also on x86 (hybrid -( setup it is implemented on DaCS ALF sprimaryarchitecturalconceptsaretask,andworkblock ALF supports task and data management such as multiple tasks, and double buffering ALF is a framework that contains a runtime algorithm for managing work load distribution and execution ALF is a framework: don t call us, we ll call you ALF developer controls the framework which in turn executes a computational kernel as the task 24

25 Data Communication and Synchronization (DaCS) Developed by IBM for hybrid systems (i.e. RoadRunner) Provides resource and process management, data communication services, and synchronization services a runtime environment Support heterogeneous computing elements such as PPE and SPE Hierarchical structure not flat like MPI Two Implementations DaCS Cell Between PPU and SPU DaCS Hybrid Between X86_64 and PPU DaCS Hybrid MPI DaCS Cell, Libspe2, CellSs.. 25

26 DaCS for Cell Process Management Programming Structure Host/PPU Accel/SPU main dacs_init dacs_reserve_children dacs_de_start (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return main dacs_init (other DaCS functions) dacs_exit return 26

27 DaCS for Hybrid Two deamons: hdacsd (x86) and adacsd (Cell) Hybrid systems are defined with a special configuration file (root privilages) RoadRunner: 1 Cell processor 1 x86 core Nautilus: 1 Cell blade 1 x86 core Handles byte-swapping Enables many interesting computing models: DaCS for Hybrid + DaCS for Cell DaCS for Hybrid + libspe2 DaCS for Hybrid + PPU accelerated libraries DaCS for Hybrid + CellSs Communication: rdma get and put commands (overlap computations and communication) Mailboxes 27

28 DaCS for Hybrid Process Management Programming Structure Host/x86_64 Mid/PPU Accel/SPU main dacs_init dacs_reserve_children - DACS_DE_CBE dacs_de_start (other DaCS functions) main dacs_init dacs_reserve_children - DACS_DE_SPE dacs_de_start (other DaCS functions) main dacs_init (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return dacs_de_wait dacs_release_de_list dacs_exit return dacs_exit return 28

29 FFTW on DaCS N = 4*524288; num_accel = 1; data = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); results = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); // init dacs_init(dacs_init_flags_none); dacs_reserve_children(dacs_de_cbe,&num_accel,&deid); dacs_remote_mem_create(data,n*sizeof(fftw_complex), DACS_READ_ONLY,&data_rm); dacs_remote_mem_create(results,n*sizeof(fftw_complex), DACS_WRITE_ONLY,&results_rm); dacs_de_start(de_list[0],"fftwh_ppu",null,null,dacs_pro C_LOCAL_FILE,&pid); dacs_remote_mem_share(deid,pid,data_rm); acs_remote_mem_share(deid,pid,results_rm); // Cell computations dacs_mailbox_read(&value,deid,pid); dacs_de_wait(deid,pid,&exit_status); dacs_remote_mem_destroy(&data_rm); dacs_release_de_list(num_accel, deid); dacs_exit(); 29

30 FFTW on DaCS dacs_init(dacs_init_flags_none); dacs_wid_reserve(&wid); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&data_rm); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&results_rm); data_local = (fftw_complex*) memalign(128,n*nfft*sizeof(fftw_complex)); dacs_get(data_local,data_rm,0,2*nfft*n*sizeof(double),wid, DACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE _WORD); dacs_wait(wid); fftplan = fftw_plan_dft_1d(n*nfft,data_local,data_local,fftw_forward,fftw_estimate); fftw_execute(fftplan); dacs_put(results_rm,0,data_local,2*nfft*n*sizeof(double),wid,d ACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE_WO RD); dacs_wait(wid); dacs_mailbox_write(&value,dacs_de_parent,dacs_pid_pa RENT); dacs_remote_mem_release(&data_rm); dacs_remote_mem_release(&results_rm); dacs_wid_release(&wid); free(data_local); dacs_exit(); 30

31 FFTW on DaCS X86: Cell: FFT: s FFT: s Communication time (GE): s 31

32 GADGET on DaCS GADGET is a freely available code for cosmological N-body/SPH simulations on massively parallel computers with distributed memory. GADGET uses an explicit communication model that is implemented with the standardized MPI communication interface. GADGET is also one of the PRACE codes Accelerate the N-body simulations performed by GADGET code A speedup of 3.9x 4.2x over reference x86 implementation (accelerator view) The MPI structure of the code was not changed We are able to run it in a multiple-cpus environment (currently tested on RoadRunner-like machine in IBM Research Center) Work by Tomasz Kłos 32

33 Summary DaCS is my favorite Cell programming model at the moment Try many different combinations: only DaCS Dacs + (libspe2,cellss,alf,openmp) Future of Cell processor Developing new Cell hybrid programming techniques Functional decomposition of computations for hybrid systems 33

34 FFTs on Cell Implement and optimize mod2f radix-4 FFT on Cell processor with Cell SuperScalar Measure performance, produce a diary with development time 34

35 FFT on CellSs 35

36 Diary Development Time (mins, hours or days) Achieved Performance (Mflops, MLUs, sec) Number of Cores (if applicable) Dataset used ( applicable (if 1h PPU (10 times) 8h SPUs (10 times) 16h SPUs (10 times) Comments (What did you do during this porting step? Why? Problems faced, etc.) Original version of the code running on the Power Processing Unit of the Cell chip no optimization applied yet. Very poor performance! First try: porting of the implemented algorithm step by step. Here, the first step was implemented: parallel computations of sin and cos table. Additionally the data layout has been changed (now a table of complex double precision numbers is created - good for performance) Parallel implementation of the first two radix-4 iterations. Performance gain is rather poor, also for smaller problems. I need to redesign the porting process. 36

37 Diary 32h SPUs (10 times) 32h SPUs (10 times) 16h SPUs (10 times) Starting from the beginning. Completely new FFT implementation with CSS. No SIMDization here. Many barriers for debug. Code was partially SIMDized. Bit reversal is now performed on-fly (DMA put calls to bit reversed memory addresses) from SPUs (in parallel). Full SIMDized code. It should push the performance when after some further tuning. Code need to be further optimized. The computational part is not the most expensive one. Looking for reasons

38 Diary 32h SPUs (10 times) 32h SPUs (10 times) 112h SPUs (10 times) I ve found a reason of poor performance very inefficient way of reversing bits in 32bit numbers. Changed to fast bit reversing method. The performance of my FFT starts to look good.. Looking for some more optimizations. Code inlining and loop unrolling was added to this version. Both of these optimizations were performed by hand. The performance should be better. Looking for reasons... Some of the FFT symmetry rules was used. Some CSs parallel tuning was performed. SIMDMath library trigonometric functions were inlined. Many other changes. Assumption: the CSs version will be used for M>15 due to task granularity problems for smaller sizes. 38

39 FFT on CellSs vs. FFTW Best result: 2656 MFlops 39

40 Thank you for your attention 40

Cell Processor and Playstation 3

Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22