Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Size: px
Start display at page:

Download "Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany"

Transcription

1 Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

2 Agenda Introduction to technology Cell programming models SPE runtime management library (libspe2) OpenMP and CBEXLC compiler Cell SuperScalar (CSs) OpenCL T-Platforms Cell Compiler Accelerated Library Framework (ALF) Data Communication and Synchronization (DaCS) hybrid models Summary FFT on CellSs 2

3 Introduction to technology Multicore, heterogeneous design PPE and 8 SPEs,( 3.2GHz ) SIMD cores ( 256KB ) Main Memory (max. 16GB) Local Store Cell based computing: Cell clusters Cell + x86 clusters Cell + POWER clusters RoadRunner-like systems Novel systems: QPACE - supercomputersdesktop Cell based workstations QS22 blade 3

4 ICM Nautilus cluster 80 IBM QS22 nodes 20 IBM LS21 nodes 4x DDR Infiniband Green500 1st place Nov 08 1st place Jun Mflops/Watt PlayStation3 1 console 2 guitars 1 mic 1 drums set Unofficial leaders Joint Cell Competence Center (IBM & ICM) Application enablement on Cell 4

5 Cell programming models Libraries Libspe2 Data Communication and Synchronization (DaCS) Accelerated Library Framework Single source CellSs OpenMP Optimization techniques asynchronous DMA transfers double-buffering SIMDization loop unrolling memory alignment assembler optimizations (AsmViz) Auto-parallelization T-Platforms compiler Novel standars OpenCL 5

6 Performance comparison How to compare performance of Cell implementations? Compare a reference x86 to Cell rather than PPE to Cell Computations on PPE are usually 2-3x slower than x86 Number of threads view? 16 SPEs vs 16 x86 cores Accelerator view 16 SPEs vs 1 x86 core 16 SPEs + x86 core vs 1 x86 core 6

7 SPE runtime management library (libspe2) The SPE runtime management library (libspe2) contains an SPE thread programming model for Cell BE applications Constitutes the standardized low-level application programming interface (API) for application access to the Cell/B.E. SPEs. Libspe2 is used to control SPE program execution from the PPE program Handles SPEs as virtual objects called SPE contexts. SPE programs can be loaded and executed by operating SPE contexts The elfspe is a PPE program that allows an SPE program to run directly from a Linux command prompt without needing a PPE application to create an SPE thread and wait for it to complete. 7

8 SPE programming PPE PPE Thread... sleep... SPE SPE Thread... work... PPE PPE Thread 0 PPE Thread 1 SPE SPE Thread... work... 8

9 PPE & SPE Synergistic Programming PPE Code #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <libspe2.h> extern spe_program_handle_t hello_spu; int main(void) {... rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info);... } SPE Code #include <stdio.h> int main(unsigned long long speid, unsigned long long argp, unsigned long long envp) { printf("hello world!\n"); return 0; } 9

10 PPE scheduling work for multiple SPEs PPE Thread 0 PPE PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread PPE Thread SPE SPE SPE Thread SPE SPE... Thread SPE SPE work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE SPE... work... Thread SPE... work... Thread... work work... 10

11 Libspe2 functionality Develop 2 programs: PPE program, SPE program Programmer must take care of: Implementation of the pthreads scheme Communication: DMA transfers (get,put) Mailboxes short messaged Optimization of the SPE code but, consumingtimeanddifficultislibspe2 withcodesdeveloping.. it gives us full control of the Cell application i.e. implementation of pipeline parallel schemes where each SPE has its own function 11

12 Example work: periodicity searching algorithms We have ported the code used in OGLE (The Optical Gravitational Lensing Experiment) project with the use of libspe2 achieving speedup of more than 19x on QS22 against reference x86 implementation Computational task: periodicity searching in observational data PPE manages work and I/O I/O overlapped with computations Additional PPU thread used for computations 12

13 OpenMP and CBEXLC compiler XL C/C++ for Multicore Acceleration for Linux supports OpenMP cbexlc o program.exe qsmp=omp program.c Can be used to parallelize simple loops Memory intensive computations on large shared tables achieve low performance due to low single SPE performance 13

14 Cell SuperScalar (CSs) task based programming model single source, directives runtime scheduler Supported libspe2&cell functionality: DMA transfers mailboxes SIMD instructions #pragma css task input(a,b) inout(c) void matvec(float *a,float *b,float *c) { int i,j; } for(i=0;i<n;i++) for(j=0;j<n/b;j++) c[j]=a[i*n/b+j]*b[i]; #pragma css start for(i=0;i<n;i+=b) matvec(a+i*n,b,c+i); #pragma css finish 14

15 OpenCL (version 0.1.1) December 1, 2009 OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. Platform requirements: IBM BladeCenter QS22 systems running Fedora 9 IBM BladeCenter JS23 systems running Red Hat Enterprise Linux 5.3 provides a full profile Power/VMX CPU device as well as an embedded profile SPU accelerator device The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB. Special care should be taken during simultaneous use as both memory and compute resources are shared Global memory is shared between the devices, which means memory consumption on one effects the availability on both 15

16 T-Platforms Cell Compiler Single source compiler for C, C++ and Fortran Auto-parallelization and auto-vectorization of source code The current implementation of Cell Compiler parallelizes loops only Benchmarked with NAS Parallel Benchmark (Embarrassingly Parallel (EP) and Conjugate Gradient (CG) problem) Outperforms OpenMP implementation (cbexlc) Beta version available for testing purposes Benchmarks on bigger codes? 16

17 gcc -O3 --fast-math sincos.c -o sincos.gcc -lm T-Platforms Cell Compiler how to use it? Write your code PPU #define N for (i=0; i<n; i++) c[i]=sin(a[i])+cos(b[i]);.. gcc -O3 --fast-math sincos.c -o sincos.ppu lm Computation time = /opt/utlcc/bin/utlcc -O3 --fast-math sincos.c -o sincos.utlcc -lm -Ws,--trace-paral UTLCC TRACE: Try to parallelize loop (#1) TRACE: Source: somewhere at sincos.c(27:28) TRACE: SUCCESS x86 Computation time = gcc -O3 --fast-math sincos.c -o sincos.x86 -lm Computation time =

18 Accelerated Library Framework (ALF) ( ALF ) Accelerated Library Framework Provides a simple user-level programming framework for Cell library developers. task management, data transfer, double buffering, data communication Supports the multiple-program-multiple-data (MPMD) parallel programming style where several programs run on different SPEs at one time Supports the scatter/gather model provided by CBE DMA list operation Two Implementations ALF Cell Between PPU and SPU ALF Hybrid Between X86_64 and PPU Application Develop programs only at the host level. Use the provided ALF libraries. Accelerated library Use the ALF API to provide the library interfaces Computational kernel Write optimized accelerator code Examples: BLAS, LAPACK Application Developer Library programmer Cell programmer 18

19 ALF Workflow 19

20 Example: SinCos computations for (i = 0; i < NUM_ROW; i++) for (j = 0; j < NUM_COL; j++) mat_c[i*num_col+j] = sin(mat_a[i*num_col+j]) + cos(mat_b[i*num_col+j]); We want to create a library that would enable us to exchange these two loops into a simple library call: sincosfun_alf(mat_a,mat_b,mat_c,num_row,num_col); 20

21 Example: SinCos computations (PPU) alf_init(null, &alf_handle); alf_query_system_info(alf_handle,alf_query_num_accel,0, &nspus); alf_num_instances_set(alf_handle, nspus); alf_task_desc_create(alf_handle, This is only the iniitialization 0, step: &task_desc_handle); alf_task_desc_set_int32(task_desc_handle, context size ALF_TASK_DESC_TSK_CTX_SIZE, 0); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(sizes_t)); workblock size (in, out) alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, maximum stack size ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, image to be loaded ALF_TASK_DESC_NUM_DTL_ENTRIES, 32); alf_task_desc_set_int32(task_desc_handle, computational kernel ALF_TASK_DESC_MAX_STACK_SIZE, 3*8192); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long)"sincosfun_spu"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long)"libsincosfun_spu.so"); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long)"comp_kernel"); 21

22 Example: SinCos computations (PPU) alf_task_create(task_desc_handle, NULL, nspus, 0, 0, &task_handle); for (i = 0; i < NUM_ROW; i += H) { alf_wb_create(task_handle, ALF_WB_SINGLE, 0, &wb_handle); Create the input buffer alf_wb_dtl_begin(wb_handle, ALF_BUF_IN, 0); alf_wb_dtl_entry_add(wb_handle, &mat_a[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_entry_add(wb_handle, &mat_b[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); alf_wb_dtl_begin(wb_handle, ALF_BUF_OUT, 0); alf_wb_dtl_entry_add(wb_handle, &mat_c[i*num_col], H * V, ALF_DATA_FLOAT); alf_wb_dtl_end(wb_handle); Create the output buffer alf_wb_parm_add(wb_handle, (void *) (&sizes), sizeof(sizes), ALF_DATA_BYTE, 0); alf_wb_enqueue(wb_handle); } Add parameters and enqueue the WB. alf_task_finalize(task_handle); This starts the execution 22

23 Example: SinCos computations (SPU) int comp_kernel(void *p_task_context, void *p_sizes_context, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) { unsigned int i, cnt; vector float *sa, *sb, *sc; sizes_t *size = (sizes_t *) p_sizes_context; cnt = size->h * size->v / 4; // vector of 4 sa = (vector float *) p_input_buffer; sb = sa + cnt; sc = (vector float *) p_output_buffer; Input and output buffers available as task parameters for (i = 0; i < cnt; i += 4) { sc[i] = spu_add(sinf4(sa[i]), cosf4(sb[i])); sc[i+1] = spu_add(sinf4(sa[i+1]), cosf4(sb[i+1])); sc[i+2] = spu_add(sinf4(sa[i+2]), cosf4(sb[i+2])); sc[i+3] = spu_add(sinf4(sa[i+3]), cosf4(sb[i+3])); } return 0; } SIMD computations 23

24 ALF summary Example application: time PPU: 2.69s time ALF (8 SPUs): 0.24s ALF works also on x86 (hybrid -( setup it is implemented on DaCS ALF sprimaryarchitecturalconceptsaretask,andworkblock ALF supports task and data management such as multiple tasks, and double buffering ALF is a framework that contains a runtime algorithm for managing work load distribution and execution ALF is a framework: don t call us, we ll call you ALF developer controls the framework which in turn executes a computational kernel as the task 24

25 Data Communication and Synchronization (DaCS) Developed by IBM for hybrid systems (i.e. RoadRunner) Provides resource and process management, data communication services, and synchronization services a runtime environment Support heterogeneous computing elements such as PPE and SPE Hierarchical structure not flat like MPI Two Implementations DaCS Cell Between PPU and SPU DaCS Hybrid Between X86_64 and PPU DaCS Hybrid MPI DaCS Cell, Libspe2, CellSs.. 25

26 DaCS for Cell Process Management Programming Structure Host/PPU Accel/SPU main dacs_init dacs_reserve_children dacs_de_start (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return main dacs_init (other DaCS functions) dacs_exit return 26

27 DaCS for Hybrid Two deamons: hdacsd (x86) and adacsd (Cell) Hybrid systems are defined with a special configuration file (root privilages) RoadRunner: 1 Cell processor 1 x86 core Nautilus: 1 Cell blade 1 x86 core Handles byte-swapping Enables many interesting computing models: DaCS for Hybrid + DaCS for Cell DaCS for Hybrid + libspe2 DaCS for Hybrid + PPU accelerated libraries DaCS for Hybrid + CellSs Communication: rdma get and put commands (overlap computations and communication) Mailboxes 27

28 DaCS for Hybrid Process Management Programming Structure Host/x86_64 Mid/PPU Accel/SPU main dacs_init dacs_reserve_children - DACS_DE_CBE dacs_de_start (other DaCS functions) main dacs_init dacs_reserve_children - DACS_DE_SPE dacs_de_start (other DaCS functions) main dacs_init (other DaCS functions) dacs_de_wait dacs_release_de_list dacs_exit return dacs_de_wait dacs_release_de_list dacs_exit return dacs_exit return 28

29 FFTW on DaCS N = 4*524288; num_accel = 1; data = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); results = (fftw_complex *) memalign(128,n*sizeof(fftw_complex)); // init dacs_init(dacs_init_flags_none); dacs_reserve_children(dacs_de_cbe,&num_accel,&deid); dacs_remote_mem_create(data,n*sizeof(fftw_complex), DACS_READ_ONLY,&data_rm); dacs_remote_mem_create(results,n*sizeof(fftw_complex), DACS_WRITE_ONLY,&results_rm); dacs_de_start(de_list[0],"fftwh_ppu",null,null,dacs_pro C_LOCAL_FILE,&pid); dacs_remote_mem_share(deid,pid,data_rm); acs_remote_mem_share(deid,pid,results_rm); // Cell computations dacs_mailbox_read(&value,deid,pid); dacs_de_wait(deid,pid,&exit_status); dacs_remote_mem_destroy(&data_rm); dacs_release_de_list(num_accel, deid); dacs_exit(); 29

30 FFTW on DaCS dacs_init(dacs_init_flags_none); dacs_wid_reserve(&wid); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&data_rm); dacs_remote_mem_accept(dacs_de_parent,dacs_pid_ PARENT,&results_rm); data_local = (fftw_complex*) memalign(128,n*nfft*sizeof(fftw_complex)); dacs_get(data_local,data_rm,0,2*nfft*n*sizeof(double),wid, DACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE _WORD); dacs_wait(wid); fftplan = fftw_plan_dft_1d(n*nfft,data_local,data_local,fftw_forward,fftw_estimate); fftw_execute(fftplan); dacs_put(results_rm,0,data_local,2*nfft*n*sizeof(double),wid,d ACS_ORDER_ATTR_NONE,DACS_BYTE_SWAP_DOUBLE_WO RD); dacs_wait(wid); dacs_mailbox_write(&value,dacs_de_parent,dacs_pid_pa RENT); dacs_remote_mem_release(&data_rm); dacs_remote_mem_release(&results_rm); dacs_wid_release(&wid); free(data_local); dacs_exit(); 30

31 FFTW on DaCS X86: Cell: FFT: s FFT: s Communication time (GE): s 31

32 GADGET on DaCS GADGET is a freely available code for cosmological N-body/SPH simulations on massively parallel computers with distributed memory. GADGET uses an explicit communication model that is implemented with the standardized MPI communication interface. GADGET is also one of the PRACE codes Accelerate the N-body simulations performed by GADGET code A speedup of 3.9x 4.2x over reference x86 implementation (accelerator view) The MPI structure of the code was not changed We are able to run it in a multiple-cpus environment (currently tested on RoadRunner-like machine in IBM Research Center) Work by Tomasz Kłos 32

33 Summary DaCS is my favorite Cell programming model at the moment Try many different combinations: only DaCS Dacs + (libspe2,cellss,alf,openmp) Future of Cell processor Developing new Cell hybrid programming techniques Functional decomposition of computations for hybrid systems 33

34 FFTs on Cell Implement and optimize mod2f radix-4 FFT on Cell processor with Cell SuperScalar Measure performance, produce a diary with development time 34

35 FFT on CellSs 35

36 Diary Development Time (mins, hours or days) Achieved Performance (Mflops, MLUs, sec) Number of Cores (if applicable) Dataset used ( applicable (if 1h PPU (10 times) 8h SPUs (10 times) 16h SPUs (10 times) Comments (What did you do during this porting step? Why? Problems faced, etc.) Original version of the code running on the Power Processing Unit of the Cell chip no optimization applied yet. Very poor performance! First try: porting of the implemented algorithm step by step. Here, the first step was implemented: parallel computations of sin and cos table. Additionally the data layout has been changed (now a table of complex double precision numbers is created - good for performance) Parallel implementation of the first two radix-4 iterations. Performance gain is rather poor, also for smaller problems. I need to redesign the porting process. 36

37 Diary 32h SPUs (10 times) 32h SPUs (10 times) 16h SPUs (10 times) Starting from the beginning. Completely new FFT implementation with CSS. No SIMDization here. Many barriers for debug. Code was partially SIMDized. Bit reversal is now performed on-fly (DMA put calls to bit reversed memory addresses) from SPUs (in parallel). Full SIMDized code. It should push the performance when after some further tuning. Code need to be further optimized. The computational part is not the most expensive one. Looking for reasons

38 Diary 32h SPUs (10 times) 32h SPUs (10 times) 112h SPUs (10 times) I ve found a reason of poor performance very inefficient way of reversing bits in 32bit numbers. Changed to fast bit reversing method. The performance of my FFT starts to look good.. Looking for some more optimizations. Code inlining and loop unrolling was added to this version. Both of these optimizations were performed by hand. The performance should be better. Looking for reasons... Some of the FFT symmetry rules was used. Some CSs parallel tuning was performed. SIMDMath library trigonometric functions were inlined. Many other changes. Assumption: the CSs version will be used for M>15 due to task granularity problems for smaller sizes. 38

39 FFT on CellSs vs. FFTW Best result: 2656 MFlops 39

40 Thank you for your attention 40

Cell Processor and Playstation 3

Cell Processor and Playstation 3 Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement. Systems and Technology Group

Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement. Systems and Technology Group Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement 1 Course Objectives You will learn how to write, build and run Hello World! on the Cell System Simulator. There are three different

More information

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P. Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,

More information

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems ( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband

More information

Cell Programming Tutorial JHD

Cell Programming Tutorial JHD Cell Programming Tutorial Jeff Derby, Senior Technical Staff Member, IBM Corporation Outline 2 Program structure PPE code and SPE code SIMD and vectorization Communication between processing elements DMA,

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: David Zhang, 6.189 Multicore Programming Primer, January (IAP) 2007.

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

SPE Runtime Management Library Version 2.2

SPE Runtime Management Library Version 2.2 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series SPE Runtime Management Library Version 2.2 SC33-8334-01 CBEA JSRE Series Cell Broadband Engine Architecture

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

PS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe)

PS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe) PS3 Programming Week 2. PPE and SPE The SPE runtime management library (libspe) Outline Overview Hello World Pthread version Data transfer and DMA Homework PPE/SPE Architectural Differences The big picture

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Cell SDK and Best Practices

Cell SDK and Best Practices Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Interconnection of Clusters of Various Architectures in Grid Systems

Interconnection of Clusters of Various Architectures in Grid Systems Journal of Applied Computer Science & Mathematics, no. 12 (6) /2012, Suceava Interconnection of Clusters of Various Architectures in Grid Systems 1 Ovidiu GHERMAN, 2 Ioan UNGUREAN, 3 Ştefan G. PENTIUC

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

PS3 Programming. Week 4. Events, Signals, Mailbox Chap 7 and Chap 13

PS3 Programming. Week 4. Events, Signals, Mailbox Chap 7 and Chap 13 PS3 Programming Week 4. Events, Signals, Mailbox Chap 7 and Chap 13 Outline Event PPU s event SPU s event Mailbox Signal Homework EVENT PPU s event PPU can enable events when creating SPE s context by

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software cache

More information

A Transport Kernel on the Cell Broadband Engine

A Transport Kernel on the Cell Broadband Engine A Transport Kernel on the Cell Broadband Engine Paul Henning Los Alamos National Laboratory LA-UR 06-7280 Cell Chip Overview Cell Broadband Engine * (Cell BE) Developed under Sony-Toshiba-IBM efforts Current

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart ECMWF Workshop on High Performance Computing in Meteorology 3 rd November 2010 Dean Stewart Agenda Company Overview Rogue Wave Product Overview IMSL Fortran TotalView Debugger Acumem ThreadSpotter 1 Copyright

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Data Communication and Synchronization for Hybrid-x86

Data Communication and Synchronization for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Data Communication and Synchronization for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8408-00 Software Development

More information

Introduction to IBM Cell/B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22

Introduction to IBM Cell/B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22 Introduction to IBM Cell/B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22 PRACE Winter School 10-13 February 2009, Athens, Greece 1 PRACE Winter School 2/16/2009 Objectives IBM Systems & Technology Group

More information

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by

More information

Hands-on - DMA Transfer Using Control Block

Hands-on - DMA Transfer Using Control Block IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using Control Block Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class Objectives

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

QDP++/Chroma on IBM PowerXCell 8i Processor

QDP++/Chroma on IBM PowerXCell 8i Processor QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) frank.winter@desy.de University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos,

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

Programming the Cell BE

Programming the Cell BE Programming the Cell BE Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-25 Outline 1 Briefly about compilation.

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications May 26-27, 2008 Juan Fernández (juanf@ditec.um.es) Gregorio Bernabé Manuel E. Acacio

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Mixed MPI-OpenMP EUROBEN kernels

Mixed MPI-OpenMP EUROBEN kernels Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Outline Short kernel description MPI and OpenMP

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

C6000 Compiler Roadmap

C6000 Compiler Roadmap C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control

More information

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability

More information

Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS

Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS Jordi Caubet. IBM Spain. IBM Innovation Initiative @ BSC-CNS ScicomP 15 18-22 May 2009, Barcelona, Spain 1 Overview SPE Runtime Management Library

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

IDE Tutorial and User s Guide

IDE Tutorial and User s Guide Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide SC34-2561-00 Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide

More information

Hands-on - DMA Transfer Using get and put Buffer

Hands-on - DMA Transfer Using get and put Buffer IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using get and put Buffer Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

SPE Runtime Management Library

SPE Runtime Management Library SPE Runtime Management Library Version 2.0 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series November 11, 2006 Table of Contents 2 Copyright International

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

CELL CULTURE. Sony Computer Entertainment, Application development for the Cell processor. Programming. Developing for the Cell. Programming the Cell

CELL CULTURE. Sony Computer Entertainment, Application development for the Cell processor. Programming. Developing for the Cell. Programming the Cell Dmitry Sunagatov, Fotolia Application development for the Cell processor CELL CULTURE The Cell architecπture is finding its way into a vast range of computer systems from huge supercomputers to inauspicious

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance

More information

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Porting Financial Market Applications to the Cell Broadband Engine Architecture

Porting Financial Market Applications to the Cell Broadband Engine Architecture Porting Financial Market Applications to the Cell Broadband Engine Architecture John Easton, Ingo Meents, Olaf Stephen, Horst Zisgen, Sei Kato Presented By: Kanik Sem Dept of Computer & Information Sciences

More information

Exercise Euler Particle System Simulation

Exercise Euler Particle System Simulation Exercise Euler Particle System Simulation Course Code: L3T2H1-57 Cell Ecosystem Solutions Enablement 1 Course Objectives The student should get ideas of how to get in welldefined steps from scalar code

More information

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor. COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October

More information

QDP++ on Cell BE WEI WANG. June 8, 2009

QDP++ on Cell BE WEI WANG. June 8, 2009 QDP++ on Cell BE WEI WANG June 8, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009 Abstract The Cell BE provides large peak floating point performance with

More information

Production. Visual Effects. Fluids, RBD, Cloth. 2. Dynamics Simulation. 4. Compositing

Production. Visual Effects. Fluids, RBD, Cloth. 2. Dynamics Simulation. 4. Compositing Visual Effects Pr roduction on the Cell/BE Andrew Clinton, Side Effects Software Visual Effects Production 1. Animation Character, Keyframing 2. Dynamics Simulation Fluids, RBD, Cloth 3. Rendering Raytrac

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Amir Khorsandi Spring 2012

Amir Khorsandi Spring 2012 Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,

More information

Crypto On the Playstation 3

Crypto On the Playstation 3 Crypto On the Playstation 3 Neil Costigan School of Computing, DCU. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET funded. Playstation

More information

Post-K: Building the Arm HPC Ecosystem

Post-K: Building the Arm HPC Ecosystem Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Porting an MPEG-2 Decoder to the Cell Architecture

Porting an MPEG-2 Decoder to the Cell Architecture Porting an MPEG-2 Decoder to the Cell Architecture Troy Brant, Jonathan Clark, Brian Davidson, Nick Merryman Advisor: David Bader College of Computing Georgia Institute of Technology Atlanta, GA 30332-0250

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU Scott Rostrup and Hans De Sterck Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada Abstract Increasingly,

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Software Development Kit for Multicore Acceleration Version 3.0

Software Development Kit for Multicore Acceleration Version 3.0 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Note

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to

More information

Data Communication and Synchronization

Data Communication and Synchronization Software Development Kit for Multicore Acceleration Version 3.0 Data Communication and Synchronization for Cell Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8407-00 Software Development

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information