Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany
|
|
- Milton Williamson
- 5 years ago
- Views:
Transcription
1 Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany
2 Agenda Introduction to technology Cell programming models SPE runtime management library (libspe2) OpenMP and CBEXLC compiler Cell SuperScalar (CSs) OpenCL T-Platforms Cell Compiler Accelerated Library Framework (ALF) Data Communication and Synchronization (DaCS) hybrid models Summary FFT on CellSs 2
3 Introduction to technology Multicore, heterogeneous design PPE and 8 SPEs,( 3.2GHz ) SIMD cores ( 256KB ) Main Memory (max. 16GB) Local Store Cell based computing: Cell clusters Cell + x86 clusters Cell + POWER clusters RoadRunner-like systems Novel systems: QPACE - supercomputersdesktop Cell based workstations QS22 blade 3
4 ICM Nautilus cluster 80 IBM QS22 nodes 20 IBM LS21 nodes 4x DDR Infiniband Green500 1st place Nov 08 1st place Jun Mflops/Watt PlayStation3 1 console 2 guitars 1 mic 1 drums set Unofficial leaders Joint Cell Competence Center (IBM & ICM) Application enablement on Cell 4
5 Cell programming models Libraries Libspe2 Data Communication and Synchronization (DaCS) Accelerated Library Framework Single source CellSs OpenMP Optimization techniques asynchronous DMA transfers double-buffering SIMDization loop unrolling memory alignment assembler optimizations (AsmViz) Auto-parallelization T-Platforms compiler Novel standars OpenCL 5
6 Performance comparison How to compare performance of Cell implementations? Compare a reference x86 to Cell rather than PPE to Cell Computations on PPE are usually 2-3x slower than x86 Number of threads view? 16 SPEs vs 16 x86 cores Accelerator view 16 SPEs vs 1 x86 core 16 SPEs + x86 core vs 1 x86 core 6
7 SPE runtime management library (libspe2) The SPE runtime management library (libspe2) contains an SPE thread programming model for Cell BE applications Constitutes the standardized low-level application programming interface (API) for application access to the Cell/B.E. SPEs. Libspe2 is used to control SPE program execution from the PPE program Handles SPEs as virtual objects called SPE contexts. SPE programs can be loaded and executed by operating SPE contexts The elfspe is a PPE program that allows an SPE program to run directly from a Linux command prompt without needing a PPE application to create an SPE thread and wait for it to complete. 7
8 SPE programming PPE PPE Thread... sleep... SPE SPE Thread... work... PPE PPE Thread 0 PPE Thread 1 SPE SPE Thread... work... 8
9 PPE & SPE Synergistic Programming PPE Code #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <libspe2.h> extern spe_program_handle_t hello_spu; int main(void) {... rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info);... } SPE Code #include <stdio.h> int main(unsigned long long speid, unsigned long long argp, unsigned long long envp) { printf("hello world!\n"); return 0; } 9
10 PPE scheduling work for multiple SPEs PPE Thread 0 PPE PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread SPE SPE SPE Thread SPE SPE... Thread SPE SPE work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE... work... Thread... work work... 10
11 Libspe2 functionality Develop 2 programs: PPE program, SPE program Programmer must take care of: Implementation of the pthreads scheme Communication: DMA transfers (get,put) Mailboxes short messaged Optimization of the SPE code but, consumingtimeanddifficultislibspe2 withcodesdeveloping.. it gives us full control of the Cell application i.e. implementation of pipeline parallel schemes where each SPE has its own function 11
12 Example work: periodicity searching algorithms We have ported the code used in OGLE (The Optical Gravitational Lensing Experiment) project with the use of libspe2 achieving speedup of more than 19x on QS22 against reference x86 implementation Computational task: periodicity searching in observational data PPE manages work and I/O I/O overlapped with computations Additional PPU thread used for computations 12
13 OpenMP and CBEXLC compiler XL C/C++ for Multicore Acceleration for Linux supports OpenMP cbexlc o program.exe qsmp=omp program.c Can be used to parallelize simple loops Memory intensive computations on large shared tables achieve low performance due to low single SPE performance 13
14 Cell SuperScalar (CSs) task based programming model single source, directives runtime scheduler Supported libspe2&cell functionality: DMA transfers mailboxes SIMD instructions #pragma css task input(a,b) inout(c) void matvec(float *a,float *b,float *c) { int i,j; } for(i=0;i<n;i++) for(j=0;j<n/b;j++) c[j]=a[i*n/b+j]*b[i]; #pragma css start for(i=0;i<n;i+=b) matvec(a+i*n,b,c+i); #pragma css finish 14
15 OpenCL (version 0.1.1) December 1, 2009 OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. Platform requirements: IBM BladeCenter QS22 systems running Fedora 9 IBM BladeCenter JS23 systems running Red Hat Enterprise Linux 5.3 provides a full profile Power/VMX CPU device as well as an embedded profile SPU accelerator device The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB. Special care should be taken during simultaneous use as both memory and compute resources are shared Global memory is shared between the devices, which means memory consumption on one effects the availability on both 15
16 T-Platforms Cell Compiler Single source compiler for C, C++ and Fortran Auto-parallelization and auto-vectorization of source code The current implementation of Cell Compiler parallelizes loops only Benchmarked with NAS Parallel Benchmark (Embarrassingly Parallel (EP) and Conjugate Gradient (CG) problem) Outperforms OpenMP implementation (cbexlc) Beta version available for testing purposes Benchmarks on bigger codes? 16
17 gcc -O3 --fast-math sincos.c -o sincos.gcc -lm T-Platforms Cell Compiler how to use it? Write your code PPU #define N for (i=0; i<n; i++) c[i]=sin(a[i])+cos(b[i]);.. gcc -O3 --fast-math sincos.c -o sincos.ppu lm Computation time = /opt/utlcc/bin/utlcc -O3 --fast-math sincos.c -o sincos.utlcc -lm -Ws,--trace-paral UTLCC TRACE: Try to parallelize loop (#1) TRACE: Source: somewhere at sincos.c(27:28) TRACE: SUCCESS x86 Computation time = gcc -O3 --fast-math sincos.c -o sincos.x86 -lm Computation time =
18 Accelerated Library Framework (ALF) ( ALF ) Accelerated Library Framework Provides a simple user-level programming framework for Cell library developers. task management, data transfer, double buffering, data communication Supports the multiple-program-multiple-data (MPMD) parallel programming style where several programs run on different SPEs at one time Supports the scatter/gather model provided by CBE DMA list operation Two Implementations ALF Cell Between PPU and SPU ALF Hybrid Between X86_64 and PPU Application Develop programs only at the host level. Use the provided ALF libraries. Accelerated library Use the ALF API to provide the library interfaces Computational kernel Write optimized accelerator code Examples: BLAS, LAPACK Application Developer Library programmer Cell programmer 18
19 ALF Workflow 19
20 Example: SinCos computations for (i = 0; i < NUM_ROW; i++) for (j = 0; j < NUM_COL; j++) mat_c[i*num_col+j] = sin(mat_a[i*num_col+j]) + cos(mat_b[i*num_col+j]); We want to create a library that would enable us to exchange these two loops into a simple library call: sincosfun_alf(mat_a,mat_b,mat_c,num_row,num_col); 20
21 Example: SinCos computations (PPU) alf_init(null, &alf_handle); alf_query_system_info(alf_handle,alf_query_num_accel,0, &nspus); alf_num_instances_set(alf_handle, nspus); alf_task_desc_create(alf_handle, This is only the iniitialization 0, step: &task_desc_handle); alf_task_desc_set_int32(task_desc_handle, context size ALF_TASK_DESC_TSK_CTX_SIZE, 0); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(sizes_t)); workblock size (in, out) alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, maximum stack size ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, image to be loaded ALF_TASK_DESC_NUM_DTL_ENTRIES, 32); alf_task_desc_set_int32(task_desc_handle, computational kernel ALF_TASK_DESC_MAX_STACK_SIZE, 3*8192); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long)"sincosfun_spu"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long)"libsincosfun_spu.so"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long)"comp_kernel"); 21
22 Example: SinCos computations (PPU) alf_task_create(task_desc_handle, NULL, nspus, 0, 0, &task_handle); for (i = 0; i < NUM_ROW; i += H) { alf_wb_create(task_handle, ALF_WB_SINGLE, 0, &wb_handle); Create the input buffer alf_wb_dtl_begin(wb_handle, ALF_BUF_IN, 0); alf_wb_dtl_entry_add(wb_handle, &mat_a[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_entry_add(wb_handle, &mat_b[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); alf_wb_dtl_begin(wb_handle, ALF_BUF_OUT, 0); alf_wb_dtl_entry_add(wb_handle, &mat_c[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); Create the output buffer alf_wb_parm_add(wb_handle, (void *) (&sizes), sizeof(sizes), ALF_DATA_BYTE, 0); alf_wb_enqueue(wb_handle); } Add parameters and enqueue the WB. alf_task_finalize(task_handle); This starts the execution 22
23 Example: SinCos computations (SPU) int comp_kernel(void *p_task_context, void *p_sizes_context, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) { unsigned int i, cnt; vector float *sa, *sb, *sc; sizes_t *size = (sizes_t *) p_sizes_context; cnt = size->h * size->v / 4; // vector of 4 sa = (vector float *) p_input_buffer; sb = sa + cnt; sc = (vector float *) p_output_buffer; Input and output buffers available as task parameters for (i = 0; i < cnt; i += 4) { sc[i] = spu_add(sinf4(sa[i]), cosf4(sb[i])); sc[i+1] = spu_add(sinf4(sa[i+1]), cosf4(sb[i+1])); sc[i+2] = spu_add(sinf4(sa[i+2]), cosf4(sb[i+2])); sc[i+3] = spu_add(sinf4(sa[i+3]), cosf4(sb[i+3])); } return 0; } SIMD computations 23
24 ALF summary Example application: time PPU: 2.69s time ALF (8 SPUs): 0.24s ALF works also on x86 (hybrid -( setup it is implemented on DaCS ALF sprimaryarchitecturalconceptsaretask,andworkblock ALF supports task and data management such as multiple tasks, and double buffering ALF is a framework that contains a runtime algorithm for managing work load distribution and execution ALF is a framework: don t call us, we ll call you ALF developer controls the framework which in turn executes a computational kernel as the task 24
25 Data Communication and Synchronization (DaCS) Developed by IBM for hybrid systems (i.e. RoadRunner) Provides resource and process management, data communication services, and synchronization services a runtime environment Support heterogeneous computing elements such as PPE and SPE Hierarchical structure not flat like MPI Two Implementations DaCS Cell Between PPU and SPU DaCS Hybrid Between X86_64 and PPU DaCS Hybrid MPI DaCS Cell, Libspe2, CellSs.. 25
26 DaCS for Cell Process Management Programming Structure Host/PPU Accel/SPU main dacs_init dacs_reserve_children dacs_de_start (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return main dacs_init (other DaCS functions) dacs_exit return 26
27 DaCS for Hybrid Two deamons: hdacsd (x86) and adacsd (Cell) Hybrid systems are defined with a special configuration file (root privilages) RoadRunner: 1 Cell processor 1 x86 core Nautilus: 1 Cell blade 1 x86 core Handles byte-swapping Enables many interesting computing models: DaCS for Hybrid + DaCS for Cell DaCS for Hybrid + libspe2 DaCS for Hybrid + PPU accelerated libraries DaCS for Hybrid + CellSs Communication: rdma get and put commands (overlap computations and communication) Mailboxes 27
28 DaCS for Hybrid Process Management Programming Structure Host/x86_64 Mid/PPU Accel/SPU main dacs_init dacs_reserve_children - DACS_DE_CBE dacs_de_start (other DaCS functions) main dacs_init dacs_reserve_children - DACS_DE_SPE dacs_de_start (other DaCS functions) main dacs_init (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return dacs_de_wait dacs_release_de_list dacs_exit return dacs_exit return 28
29 FFTW on DaCS N = 4*524288; num_accel = 1; data = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); results = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); // init dacs_init(dacs_init_flags_none); dacs_reserve_children(dacs_de_cbe,&num_accel,&deid); dacs_remote_mem_create(data,n*sizeof(fftw_complex), DACS_READ_ONLY,&data_rm); dacs_remote_mem_create(results,n*sizeof(fftw_complex), DACS_WRITE_ONLY,&results_rm); dacs_de_start(de_list[0],"fftwh_ppu",null,null,dacs_pro C_LOCAL_FILE,&pid); dacs_remote_mem_share(deid,pid,data_rm); acs_remote_mem_share(deid,pid,results_rm); // Cell computations dacs_mailbox_read(&value,deid,pid); dacs_de_wait(deid,pid,&exit_status); dacs_remote_mem_destroy(&data_rm); dacs_release_de_list(num_accel, deid); dacs_exit(); 29
30 FFTW on DaCS dacs_init(dacs_init_flags_none); dacs_wid_reserve(&wid); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&data_rm); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&results_rm); data_local = (fftw_complex*) memalign(128,n*nfft*sizeof(fftw_complex)); dacs_get(data_local,data_rm,0,2*nfft*n*sizeof(double),wid, DACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE _WORD); dacs_wait(wid); fftplan = fftw_plan_dft_1d(n*nfft,data_local,data_local,fftw_forward,fftw_estimate); fftw_execute(fftplan); dacs_put(results_rm,0,data_local,2*nfft*n*sizeof(double),wid,d ACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE_WO RD); dacs_wait(wid); dacs_mailbox_write(&value,dacs_de_parent,dacs_pid_pa RENT); dacs_remote_mem_release(&data_rm); dacs_remote_mem_release(&results_rm); dacs_wid_release(&wid); free(data_local); dacs_exit(); 30
31 FFTW on DaCS X86: Cell: FFT: s FFT: s Communication time (GE): s 31
32 GADGET on DaCS GADGET is a freely available code for cosmological N-body/SPH simulations on massively parallel computers with distributed memory. GADGET uses an explicit communication model that is implemented with the standardized MPI communication interface. GADGET is also one of the PRACE codes Accelerate the N-body simulations performed by GADGET code A speedup of 3.9x 4.2x over reference x86 implementation (accelerator view) The MPI structure of the code was not changed We are able to run it in a multiple-cpus environment (currently tested on RoadRunner-like machine in IBM Research Center) Work by Tomasz Kłos 32
33 Summary DaCS is my favorite Cell programming model at the moment Try many different combinations: only DaCS Dacs + (libspe2,cellss,alf,openmp) Future of Cell processor Developing new Cell hybrid programming techniques Functional decomposition of computations for hybrid systems 33
34 FFTs on Cell Implement and optimize mod2f radix-4 FFT on Cell processor with Cell SuperScalar Measure performance, produce a diary with development time 34
35 FFT on CellSs 35
36 Diary Development Time (mins, hours or days) Achieved Performance (Mflops, MLUs, sec) Number of Cores (if applicable) Dataset used ( applicable (if 1h PPU (10 times) 8h SPUs (10 times) 16h SPUs (10 times) Comments (What did you do during this porting step? Why? Problems faced, etc.) Original version of the code running on the Power Processing Unit of the Cell chip no optimization applied yet. Very poor performance! First try: porting of the implemented algorithm step by step. Here, the first step was implemented: parallel computations of sin and cos table. Additionally the data layout has been changed (now a table of complex double precision numbers is created - good for performance) Parallel implementation of the first two radix-4 iterations. Performance gain is rather poor, also for smaller problems. I need to redesign the porting process. 36
37 Diary 32h SPUs (10 times) 32h SPUs (10 times) 16h SPUs (10 times) Starting from the beginning. Completely new FFT implementation with CSS. No SIMDization here. Many barriers for debug. Code was partially SIMDized. Bit reversal is now performed on-fly (DMA put calls to bit reversed memory addresses) from SPUs (in parallel). Full SIMDized code. It should push the performance when after some further tuning. Code need to be further optimized. The computational part is not the most expensive one. Looking for reasons
38 Diary 32h SPUs (10 times) 32h SPUs (10 times) 112h SPUs (10 times) I ve found a reason of poor performance very inefficient way of reversing bits in 32bit numbers. Changed to fast bit reversing method. The performance of my FFT starts to look good.. Looking for some more optimizations. Code inlining and loop unrolling was added to this version. Both of these optimizations were performed by hand. The performance should be better. Looking for reasons... Some of the FFT symmetry rules was used. Some CSs parallel tuning was performed. SIMDMath library trigonometric functions were inlined. Many other changes. Assumption: the CSs version will be used for M>15 due to task granularity problems for smaller sizes. 38
39 FFT on CellSs vs. FFTW Best result: 2656 MFlops 39
40 Thank you for your attention 40
Cell Processor and Playstation 3
Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationHello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement. Systems and Technology Group
Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement 1 Course Objectives You will learn how to write, build and run Hello World! on the Cell System Simulator. There are three different
More informationConcurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.
Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,
More information( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems
( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband
More informationCell Programming Tutorial JHD
Cell Programming Tutorial Jeff Derby, Senior Technical Staff Member, IBM Corporation Outline 2 Program structure PPE code and SPE code SIMD and vectorization Communication between processing elements DMA,
More informationMIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:
MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: David Zhang, 6.189 Multicore Programming Primer, January (IAP) 2007.
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationSony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationSPE Runtime Management Library Version 2.2
CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series SPE Runtime Management Library Version 2.2 SC33-8334-01 CBEA JSRE Series Cell Broadband Engine Architecture
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationPS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe)
PS3 Programming Week 2. PPE and SPE The SPE runtime management library (libspe) Outline Overview Hello World Pthread version Data transfer and DMA Homework PPE/SPE Architectural Differences The big picture
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationCell SDK and Best Practices
Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationProgramming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University
Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationInterconnection of Clusters of Various Architectures in Grid Systems
Journal of Applied Computer Science & Mathematics, no. 12 (6) /2012, Suceava Interconnection of Clusters of Various Architectures in Grid Systems 1 Ovidiu GHERMAN, 2 Ioan UNGUREAN, 3 Ştefan G. PENTIUC
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationPS3 Programming. Week 4. Events, Signals, Mailbox Chap 7 and Chap 13
PS3 Programming Week 4. Events, Signals, Mailbox Chap 7 and Chap 13 Outline Event PPU s event SPU s event Mailbox Signal Homework EVENT PPU s event PPU can enable events when creating SPE s context by
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software cache
More informationA Transport Kernel on the Cell Broadband Engine
A Transport Kernel on the Cell Broadband Engine Paul Henning Los Alamos National Laboratory LA-UR 06-7280 Cell Chip Overview Cell Broadband Engine * (Cell BE) Developed under Sony-Toshiba-IBM efforts Current
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart
ECMWF Workshop on High Performance Computing in Meteorology 3 rd November 2010 Dean Stewart Agenda Company Overview Rogue Wave Product Overview IMSL Fortran TotalView Debugger Acumem ThreadSpotter 1 Copyright
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationData Communication and Synchronization for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Data Communication and Synchronization for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8408-00 Software Development
More informationIntroduction to IBM Cell/B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22
Introduction to IBM Cell/B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22 PRACE Winter School 10-13 February 2009, Athens, Greece 1 PRACE Winter School 2/16/2009 Objectives IBM Systems & Technology Group
More informationThe Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM
The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by
More informationHands-on - DMA Transfer Using Control Block
IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using Control Block Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class Objectives
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationQDP++/Chroma on IBM PowerXCell 8i Processor
QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) frank.winter@desy.de University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos,
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationPRACE Autumn School Basic Programming Models
PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationWhat does Heterogeneity bring?
What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or
More informationProgramming the Cell BE
Programming the Cell BE Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-25 Outline 1 Briefly about compilation.
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationMulticore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications
Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications May 26-27, 2008 Juan Fernández (juanf@ditec.um.es) Gregorio Bernabé Manuel E. Acacio
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationMixed MPI-OpenMP EUROBEN kernels
Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Outline Short kernel description MPI and OpenMP
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationC6000 Compiler Roadmap
C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control
More informationFahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County
Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability
More informationProgramming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS
Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS Jordi Caubet. IBM Spain. IBM Innovation Initiative @ BSC-CNS ScicomP 15 18-22 May 2009, Barcelona, Spain 1 Overview SPE Runtime Management Library
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationIDE Tutorial and User s Guide
Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide SC34-2561-00 Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide
More informationHands-on - DMA Transfer Using get and put Buffer
IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using get and put Buffer Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationSPE Runtime Management Library
SPE Runtime Management Library Version 2.0 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series November 11, 2006 Table of Contents 2 Copyright International
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationCELL CULTURE. Sony Computer Entertainment, Application development for the Cell processor. Programming. Developing for the Cell. Programming the Cell
Dmitry Sunagatov, Fotolia Application development for the Cell processor CELL CULTURE The Cell architecπture is finding its way into a vast range of computer systems from huge supercomputers to inauspicious
More informationReconstruction of Trees from Laser Scan Data and further Simulation Topics
Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMulticore Challenge in Vector Pascal. P Cockshott, Y Gdura
Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance
More informationThe Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE
The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationPorting Financial Market Applications to the Cell Broadband Engine Architecture
Porting Financial Market Applications to the Cell Broadband Engine Architecture John Easton, Ingo Meents, Olaf Stephen, Horst Zisgen, Sei Kato Presented By: Kanik Sem Dept of Computer & Information Sciences
More informationExercise Euler Particle System Simulation
Exercise Euler Particle System Simulation Course Code: L3T2H1-57 Cell Ecosystem Solutions Enablement 1 Course Objectives The student should get ideas of how to get in welldefined steps from scalar code
More informationCOMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationQDP++ on Cell BE WEI WANG. June 8, 2009
QDP++ on Cell BE WEI WANG June 8, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009 Abstract The Cell BE provides large peak floating point performance with
More informationProduction. Visual Effects. Fluids, RBD, Cloth. 2. Dynamics Simulation. 4. Compositing
Visual Effects Pr roduction on the Cell/BE Andrew Clinton, Side Effects Software Visual Effects Production 1. Animation Character, Keyframing 2. Dynamics Simulation Fluids, RBD, Cloth 3. Rendering Raytrac
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationAmir Khorsandi Spring 2012
Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,
More informationCrypto On the Playstation 3
Crypto On the Playstation 3 Neil Costigan School of Computing, DCU. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET funded. Playstation
More informationPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach
More informationTechnology Trends Presentation For Power Symposium
Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationPorting an MPEG-2 Decoder to the Cell Architecture
Porting an MPEG-2 Decoder to the Cell Architecture Troy Brant, Jonathan Clark, Brian Davidson, Nick Merryman Advisor: David Bader College of Computing Georgia Institute of Technology Atlanta, GA 30332-0250
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationParallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU
Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU Scott Rostrup and Hans De Sterck Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada Abstract Increasingly,
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationSoftware Development Kit for Multicore Acceleration Version 3.0
Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Note
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationData Communication and Synchronization
Software Development Kit for Multicore Acceleration Version 3.0 Data Communication and Synchronization for Cell Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8407-00 Software Development
More informationOpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa
OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More information