Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Size: px
Start display at page:

Download "Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University"

Transcription

1 Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University

2 THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor developed jointly by IBM, Sony and Toshiba Novel Heterogeneous architecture 1 control-intensive 64-bit PowerPC core + 8 computeintensive SIMD co-processors (SPEs) Supercomputer on a chip Cell Highlights Peak speeds of Gflop/s in SP and Gflop/s in DP (only SPEs) GB/s internal EIB bandwidth 25.6 GB/s Memory bandwidth 76.8 GB/s I/O bandwidth

3 MPI on Cell BE

4 CELL FOR HPC -THE CHALLENGE Cell has tremendous potential for scientific computation and has generated a lot of interest in the HPC community Source : "The Potential of the Cell Processor for Scientific Computing", Proceedings of the ACM International Conference on Computing Frontiers, Software has to be rewritten to specifically use the multiple cores in an effective manner Applications need significant changes to fully exploit the novel architecture For Cell to be a success, software adoption is the key Cell is a challenging environment for software development Unconventional architecture of SPEs Small size of the SPE Local Store (LS) 256KB Cell-specific programming models Control plane vs. Data plane processing Is there a way to utilize the enormous potential of the Cell, and at the same time avoid the painful process of rewriting software?

5 OUR SOLUTION Implementation of MPI 1 Enables running of a large number of scientific applications without much effort Salient Features of our Implementation Cell as an 8-node SMP (Intra-Cell MPI) 16 SPEs can be used on Cell Blade, but NUMA aspects come into play Core features of MPI 1 have been implemented The results we obtained by running on a 3.2 GHz Cell Blade, are encouraging

6 Intra-Cell MPI Design Choices Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance MPI design choices Application data in: (i) local store or (ii) main memory MPI data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Collective calls can: (i) synchronize or (ii) not synchronize

7 Blocking Point to Point Communication Blocking point to point Communication calls form the core component of MPI Collective communication calls can be built on top of point to point communication calls MPI_Send MPI_Recv MPI_Bsend MPI_Ssend MPI_Rsend

8 Fully SPE-Centric Approach with Data in Main Memory Application data is present in main memory. N Message buffers in located in the Main Memory Meta Data buffers are also located in Local Store SPEs perform the buffer management. SPEs interact with the PPE only during initialization SPEs move the message form sender s buffer to receivers buffer using buffers in local store Location Tag Datatype Size Flag

9 DESIGN ISSUES Lock Free Data Structure Local store is single ported. That is, at a given clock cycle, either a DMA operation can access the local store or the load store unit of the SPE can access it. DMA writes are in units of 128 bytes. The meta-data entry is less than 128 bytes and will therefore be seen in full, or not seen at all

10 Communication Performance Best latency obtained is 0.41 µs For 0-byte message the send involves one DMA for meta-data transfer and the receive operation does a DMA for signaling the sender which accounts for half of the latency. Bandwidth obtained was just above 6 GB/s In the presence of congestion the peak bandwidth dropped down to 4.48 GB/s

11 Comparison of performance of MPICELL with other MPI implementations MPI/Platform Latency (0 Byte) Maximum throughput Cell 0.41 µs 6.01 GB/s Cell Congested NA 4.48 GB/s Cell Small 0.65 µs GB/s Nemesis/Xeon 1.1 ms 0.65 GB/s Shm/Xeon 1.3 ms 0.5 GB/s Open MPI/Xeon 2.8 ms 0.5 GB/s Nemesis/Opteron» 0.35 ms» 1.5 GB/s Open MPI/Opteron 0.7 ms 1.0 GB/s TMPI/Origin ms GB/s

12 Application Performance Achieved double precision throughput of 5.85 Gflop/s for matrices of size 512 and 5.66 Gflop/s for matrices of size 1024 in matrix multiplication. Achieved a peak double precision throughput of 7.8 Gflops/s for matrices of size of 1024 for matrix vector multiplication While this is low compared to the peak performance of Gflop/s, the Cell still gives better results than on an Opteron processor, for the same algorithm. A 2 GHz dual processor (not dual core) Opteron with Gigabit interconnect yielded 40 Mflop/s for the same algorithm with one worker processor, and 210Mflop/s with seven worker processors (the same as for the Cell). Comparison of speedup obtained for DP matrix vector multiplication on Cell SPEs with that on an SMP of 2GHz dual processor Opterons.

13 Barrier Broadcast Scatter Gather Allgather Alltoall Collective Communication Calls

14 Factors required in optimizing the collective calls are a. Assigning Effective unique Ids (ranks) b. Using a Efficient algorithms Work Done : Effective Ranks: Experiments were performed on the EIB to understand the reasons of Bandwidth Reduction. Based on above finding several ranking schemes were studied to get effective ranks. Efficient Algorithms: a. Several algorithms were implemented to choose the best one among them. Effective Ranks + Efficient Algorithm = Optimized Collective Call

15 Summary The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Thread SPE affinity plays an important role in optimizing the performance of collective calls on intra-cell BE.

16 Programming for Performance on the Cell BE processor 09/26/08

17 Outline Introduction to Cell BE architecture IBM Software Development Kit Programming on Cell BE o Programming the SPE o Program the communication between PPE and SPE Optimizations specific to SPE Performance Analysis Tools o Static profiling o Dynamic profiling

18 Overview of the Cell Broadband Engine One Power Processor Element (PPE) & Eight Synergistic Processor Element (SPE). Heterogeneous Cel multi-core architecture o Power Processor Element for control tasks o Synergistic Processor Elements for dataintensive processing Synergistic Processor Element (SPE) Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) o o Data movement and synchronization Interface to high-performance Element Interconnect Bus Element Interconnect Bus (EIB) Source: Some of the figures & material on Cell BE used in this presentation are collected from various workshop presentations, webpages and books of IBM Sri Sathya Sai University, Prasanthi Nilayam

19 How Cell BE addresses these Walls Increased efficiency and performance: Attacks on the Power Wall Non Homogenous Coherent Multiprocessor High design a low operating voltage with advanced power management 45nm: 0.8v, 50w, 3.2GHz. Attacks on the Memory Wall Streaming DMA architecture 128 simultaneous transfers between the eight SPE local stores and main storage. 3-level Memory Model: Main Storage, Local Storage, Register Files Attacks on the Frequency Wall Highly optimized implementation Large shared register files and software controlled branching to allow deeper pipelines Sri Sathya Sai University, Prasanthi Nilayam

20 Micro Architectural Decisions Large shared register file Local store size tradeoffs Dual issue, In order Software branch prediction SPE Channel interface Sri Sathya Sai University, Prasanthi Nilayam

21 Cell BE Peak Performance In single precision, in one cycle each SPE can: o Process a four element vector, o Perform two operations on each element. 8 (SPE) x 3.2GHz x 4 ( four 32 bit words in a vector) x 2 (Multiply-Adds are counted as 2 operations) = SP Gflops Each SPE is capable of 25.6 SP Gflops in one cycle the VMX on the PPE can: process a four element vector, and perform two operations on each element. 4 x 2 x 3.2 GHz = 25.6 SP Gflops. Sri Sathya Sai University, Prasanthi Nilayam

22 Cell BE Peak Performance In double precision, Every seven cycles each SPE can: o Process a two element vector, o Perform two operations on each element. 8(SPE) x 3.2GHz x 2 ( two 64 bit double words in a vector) x 2 (Multiply-Adds counted as 2 operations) / 7 = DP Gflops In one cycle the FPU on the PPE can: process one element, perform two operations on the element 2 x 3.2 GHz = 6.4 Gflops.

23 Cell BE Programming Near Theoretical-maximum Performance is attainable on real applications. Need to be aware of the architectural characteristics --- Multiple heterogeneous execution units, SIMD, Limited Local Store, Software Managed Cache, Memory access latencies, dual issue rules, both large and wide register files, quad-word memory accesses, branch prediction and synchronization facilities.

24 Programmer Experience Cell Software Environment Code Dev Tools Samples Workloads Demos End- User Experie nce Development Environment Debug Tools SPE Management Lib Application Libs Execution Environment Development Tools Stack Performance Tools Linux PPC64 with Cell Extensions Verification Hypervisor Miscellaneous Tools Hardware or System Level Simulator Standards: Language extensions ABI Sri Sathya Sai University, Prasanthi Nilayam

25 Simulator Overview Console Window Application Source Code Development Environment: Software Stack: Running on SystemSim SystemSim: Simulation of hardware Real Systems: Programming Tools Programming Model OpenMP MPI Compilers Executables Runtime and libraries System Software: Hypervisor, Linux/PPC or K42 PowerPC BE Linux (Fedora Core 7) PowerPC Intel x86 x86-64 SystemSim ROM Disks DMA UART Caches (L1/L2) Int Ctrlr Memory Bus L3 GUI Windows Traces Source : IBM Workshop presentations

26 Cell Simulator Environment Execution Environment Source : IBM Georgia TechWorkshop presentations

27 Cell BE Programming Practices Programming the PPE is straight forward. PPE is multi-threaded SIMD Offload as much computation onto the SPEs as possible. Use PPE as a control processor...

28 How to create SPE threads Basic Steps in creating SPE threads 1. Loads the SPE program to the LS. 2. Instructs the SPE to execute the SPE program. 3. Transfers required data from the main memory to the LS. 4. Processes the data in the LS in accordance with the requirements. 5. Transfers the processed result from the LS to the main memory. 6. Notifies the PPE program of the termination of processing. source: Cell Programming Primer

29 SPE program execution The SPE program is Invoked from the PPE program as follows: Open the SPE program image. spe_program_handle_t * spe_image_open(const char *filename); 1. Create the SPE context. spe_context_ptr_t spe_context_create(unsigned int flags, spe_gang_context_ptr_t gang); 2. Load the SPE program into the LS. int spe_program_load(spe_context_ptr_t spe, spe_program_handle_t *program); 3. Execute the loaded SPE program. int spe_context_run(spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, spe_stop_info_t *stopinfo); 4. Destroy the SPE context. int spe_context_destroy(spe_context_ptr_t spe); 5. Close the SPE program image. int spe_image_close(spe_program_handle_t *program); Sri Sathya Sai University, Prasanthi Nilayam

30 What is different about the BE architecture? The memory subsystem SPE memory architecture o each SPE has its own flat local memory o management of this memory is by explicit software control o code and data are moved into and out of this memory by DMA o programming the DMA is explicit in SPE code (or in PPC code) Many DMA transactions can be in flight simultaneously oe.g. each SPE can have 16 simultaneous outstanding DMA requests oa DMA request can be a list of DMA requests odma latencies can be hidden using multiple buffers and loop blocking in code in contrast to traditional, hierarchical memory architectures that support few simultaneous memory transactions Implications for programming the BE applications must be partitioned across the processing elements, taking into account the limited local memory available to each SPE

31 Cell s Primary Communication Mechanisms DMA transfers, mailbox messages, and signal-notification Local Store SXU Data Bus Snoop Bus Control Bus Xlate Load/Store MMIO All three are implemented and controlled by the SPE s MFC DMA queue 16 entries for SPU 8 entries for PPU DMA Engine Atomic Facility Bus I/F Control MMU DMA Queue RMT MMIO

32 MFC Commands Main mechanism for SPUs to access main storage (DMA commands) maintain synchronization with other processors and devices in the system (Synchronization commands) MFC commands can be issued either by SPU via its MFC. Code running on the SPU issues an MFC command by executing a series of writes and/or reads using channel instructions read channel (rdch), write channel (wrch), and read channel count (rchcnt). PPE and other device can issue MFC commands as follows: Code running on the PPE or other devices issues an MFC command by performing a series of stores and/or loads to memory-mapped I/O (MMIO) registers in the MFC MFC commands are queued in one of two independent MFC command queues: MFC SPU Command Queue For channel-initiated commands by the associated SPU MFC Proxy Command Queue For MMIO-initiated commands by the PPE or other device

33 DMA Commands MFC commands that transfer data are referred to as DMA commands Transfer direction for DMA commands referenced from the SPE Into an SPE (from main storage to local store) mfc_get mfc_get( lsaddr, ea, size, tag_id, tid, rid); lsaddr :- target address in SPU local store for fetched data (SPU local address) ea size :- effective address from which data is fetched (global address) :- transfer size in bytes tag_id :- tag-group identifier tid rid :- transfer-class id :- replacement-class id Out of an SPE (from local store to main storage) mfc_put

34 DMA Commands Channel Control Intrinsics spu_writech Composite Intrinsics spu_dmfcdma32 MFC Commands mfc_get defined as macros in spu_mfcio.h For details see: SPU C/C++ Language Extensions

35 DMA Characteristics DMA transfers o transfer sizes can be 1, 2, 4, 8, and n*16 bytes (n integer) o maximum is 16KB per DMA transfer o 128B alignment is preferable DMA command queues per SPU o 16-element queue for SPU-initiated requests o 8-element queue for PPE-initiated requests o SPU-initiated DMA is always preferable DMA tags o each DMA command is tagged with a 5-bit identifier o same identifier can be used for multiple commands o tags used for polling status or waiting on completion of DMA commands DMA lists o a single DMA command can cause execution of a list of transfer requests (in LS) o lists implement scatter-gather functions o a list can contain up to 2K transfer requests

36 DMA Example: Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { } unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); Read contents of mem_addr into ls_addr Set tag mask Wait for all tag DMA completed Sri Sathya Sai University, Prasanthi Nilayam

37 DMA Example: Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { } unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); Write contents of mem_addr into ls_addr Set tag mask Set tag mask Sri Sathya Sai University, Prasanthi Nilayam

38 SPE SPE DMA Address in the other SPE s local store is represented as a 32-bit effective address (global address) SPE issuing the DMA command needs a pointer to the other SPE s local store as a 32-bit effective address (global address) PPE code can obtain effective address of an SPE s local store: #include <libspe.h> speid_t speid; void *spe_ls_addr;.. spe_ls_addr = spe_get_ls(speid); Effective address of an SPE s local store can then be made available to other SPEs (e.g. via DMA or mailbox) Sri Sathya Sai University, Prasanthi Nilayam

39 MailBoxes Each MFC provides 3 mailboxes SPE-outbound mailbox queue o SPE writes PPE reads. o 1-deep o SPE stalls writing to full mailbox. SPU outbound interrupt mailbox o Same as above but interrupt is posted to ppe when mailbox is written. SPU inbound mailbox queue o PPE writes SPE reads. o 4-deep o Can be overwritten Sri Sathya Sai University, Prasanthi Nilayam

40 Performance Optimization Tips Cell micro-architecture features are exposed to not only its compilers but also its applications performance gains from tuning compilers and applications can be significant tools/simulators are provided to assist in performance optimization efforts Sri Sathya Sai University, Prasanthi Nilayam

41 Double Buffering Consider an SPE program that requires large amount of data from main memory. To achieve this data transfer the scheme employed is o Start the DMA transfer from main storage to buffer B in LS. o Wait for the DMA transfer to complete. o Use the data in buffer B. o Repeat. A lot of time is wasted waiting for DMA transfer to complete. We can speed up the process by o Allocating two buffers B0 and B1 o Overlapping computation of one buffer with data transfer in other. Double Buffering is a form of multi-buffering where multiple buffers are used in circular queue to overlap data transfer with processing.

42 Overlap DMA with computation Double or multi-buffer code or (typically) data Example for double buffering n+1 data blocks: o Use multiple buffers in local store o Use unique DMA tag ID for each buffer o Use fence commands to order DMAs within a tag group o Use barrier commands to order DMAs within a queue The purpose of double buffering is to o o maximize the time spent in the compute phase of a program minimize the time spent waiting for DMA transfers to complete Sri Sathya Sai University, Prasanthi Nilayam

43 Start DMAs from SPU Use SPE-initiated DMA transfers rather than PPE-initiated DMA transfers, because o there are more SPEs than the one PPE o the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16 Sri Sathya Sai University, Prasanthi Nilayam

44 EIB Data Topology Nominal design speed for PPE core and SPUs in 3.2GHz Bus speed is half the SPU speed, Peak bus bandwidth is 96B per core cycle each way Four rings, max 3 simultaneous transfers per ring at 8B per core cycle each Theoretical peak bandwidth: G Bytes/s.

45 Design for Limited Local Store The Local Store holds up to 256 KB for the program, stack, local data structures, and DMA buffers. Most performance optimizations put pressure on local store (e.g. multiple DMA buffers) Sri Sathya Sai University, Prasanthi Nilayam

46 SIMD SIMD exploits data-level parallelism a single instruction can apply the same operation to multiple data elements in parallel SIMD units employ vector registers oeach register holds multiple data elements SIMD is pervasive in the BE oppe includes VMX (SIMD extensions to PPC architecture) ospe is a native SIMD architecture (VMX-like) SIMD in VMX and SPE o 128bit-wide datapath o 128bit-wide registers o 4-wide fullwords, 8-wide halfwords, 16-wide bytes o SPE includes support for 2-wide doublewords

47 SPE Pipeline and Dual-Issue Rules SPE has two pipelines, even(pipeline 0) and odd (pipeline 1) SPE can issue and complete two instructions per cycle Sri Sathya Sai University, Prasanthi Nilayam

48 Sri Sathya Sai University, Prasanthi Nilayam Instruction Scheduling

49 Loop Unrolling Unroll loops to reduce dependencies increase dual-issue rates This exploits the large SPU register file. Compiler auto-unrolling is not perfect, but pretty good. Example: for(i=1, i<100, i++) { } a[i] = b[i+2] * c[i-1]; for(i=1, i<99, i+=2) { a[i] = b[i+2] * c[i-1]; a[i+1] = b[i+3] * c[i]; }

50 Loop Unrolling Performance Sri Sathya Sai University, Prasanthi Nilayam

51 SPU Timing tool Static Profiling Provides a static timing analysis of compiled SPE code based on issue rules, pipeline latencies, and static dependencies assumes all branches not taken cannot account for data-dependent behaviour sscal (scaling a vector)

52 Sri Sathya Sai University, Prasanthi Nilayam SPU Software Pipeline

53 Simulator Dynamic Profiling. Sri Sathya Sai University, Prasanthi Nilayam

54 Benchmarking on the Hardware Use SPE Decrementer Register value To time a piece of code Starttimer // Initialize decrementer register some large value. { Code to be timed} Stoptimer //read decrementer register value (Start stop) / DECREMENTER_FREQ gives the time taken in sec. We can get the timebase information using the command: /proc/cpuinfo Sri Sathya Sai University, Prasanthi Nilayam

55 Branch Optimizations SPE Heavily pipelined high penalty for branch misses (18 cycles) Hardware policy: assume all branches are not taken Advantage Reduced hardware complexity Faster clock cycles Increased predictability Solution approaches If-conversions: compare and select operations Predications/code re-org: compiler analysis, user directives Branch hint instruction (hbr, 11 cycles before branch) Sri Sathya Sai University, Prasanthi Nilayam

56 Sri Sathya Sai University, Prasanthi Nilayam Branches

57 Integer Multiplies Avoid integer multiplies on operands greater than 16 bits SPU supports only a 16-bit x16-bit multiply 32-bit multiply requires five instructions (three 16-bit multiplies and two adds) Keep array elements sized to a power-of-2 to avoid multiplies when indexing. Cast operands to unsigned short prior to multiplying. Constants are of type int and also require casting. Use a macro to explicitly perform 16-bit multiplies. This can avoid inadvertent introduction of signed extends and masks due to casting. #define MULTIPLY(a, b)\ (spu_extract(spu_mulo((vector unsigned short)spu_promote(a,0),\ (vector unsigned short)spu_promote(b, 0)),0)) Sri Sathya Sai University, Prasanthi Nilayam

58 Sri Sathya Sai University, Prasanthi Nilayam Avoid Scalar Code

59 Summary Get code running on PPU (easy port) Deal with memory alignment concerns Identify candidate code for porting to SPUs Parallelize problem for SPE utilization Establish communication methodology Port scalar code to SPUs Optimize data transfers (double-buffering) SIMDize code and unroll loops Use Performance Analysis Tools 09/26/08

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Cell Programming Tutorial JHD

Cell Programming Tutorial JHD Cell Programming Tutorial Jeff Derby, Senior Technical Staff Member, IBM Corporation Outline 2 Program structure PPE code and SPE code SIMD and vectorization Communication between processing elements DMA,

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Buffered-Mode MPI Implementation for the Cell BE Processor A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Cell Broadband Engine Architecture. Version 1.0

Cell Broadband Engine Architecture. Version 1.0 Copyright and Disclaimer Copyright International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation 2005 All Rights Reserved Printed in the United States of America

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total

More information

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P. Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,

More information

Cell SDK and Best Practices

Cell SDK and Best Practices Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction

More information

Software Development Kit for Multicore Acceleration Version 3.0

Software Development Kit for Multicore Acceleration Version 3.0 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Note

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

PS3 Programming. Week 4. Events, Signals, Mailbox Chap 7 and Chap 13

PS3 Programming. Week 4. Events, Signals, Mailbox Chap 7 and Chap 13 PS3 Programming Week 4. Events, Signals, Mailbox Chap 7 and Chap 13 Outline Event PPU s event SPU s event Mailbox Signal Homework EVENT PPU s event PPU can enable events when creating SPE s context by

More information

After a Cell Broadband Engine program executes without errors on the PPE and the SPEs, optimization through parameter-tuning can begin.

After a Cell Broadband Engine program executes without errors on the PPE and the SPEs, optimization through parameter-tuning can begin. Performance Main topics: Performance analysis Performance issues Static analysis of SPE threads Dynamic analysis of SPE threads Optimizations Static analysis of optimization Dynamic analysis of optimizations

More information

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009

More information

Optimizing Assignment of Threads to SPEs on the Cell BE Processor

Optimizing Assignment of Threads to SPEs on the Cell BE Processor Optimizing Assignment of Threads to SPEs on the Cell BE Processor T. Nagaraju P.K. Baruah Ashok Srinivasan Abstract The Cell is a heterogeneous multicore processor that has attracted much attention in

More information

Cell Broadband Engine Architecture. Version 1.02

Cell Broadband Engine Architecture. Version 1.02 Copyright and Disclaimer Copyright International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation 2005, 2007 All Rights Reserved Printed in the United States

More information

PS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe)

PS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe) PS3 Programming Week 2. PPE and SPE The SPE runtime management library (libspe) Outline Overview Hello World Pthread version Data transfer and DMA Homework PPE/SPE Architectural Differences The big picture

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: David Zhang, 6.189 Multicore Programming Primer, January (IAP) 2007.

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Hands-on - DMA Transfer Using get and put Buffer

Hands-on - DMA Transfer Using get and put Buffer IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using get and put Buffer Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class

More information

SPE Runtime Management Library

SPE Runtime Management Library SPE Runtime Management Library Version 2.0 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series November 11, 2006 Table of Contents 2 Copyright International

More information

A Synchronous Mode MPI Implementation on the Cell BE Architecture

A Synchronous Mode MPI Implementation on the Cell BE Architecture A Synchronous Mode MPI Implementation on the Cell BE Architecture Murali Krishna 1, Arun Kumar 1, Naresh Jayam 1, Ganapathy Senthilkumar 1, Pallav K Baruah 1, Raghunath Sharma 1, Shakti Kapoor 2, Ashok

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10

More information

SPE Runtime Management Library Version 2.2

SPE Runtime Management Library Version 2.2 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series SPE Runtime Management Library Version 2.2 SC33-8334-01 CBEA JSRE Series Cell Broadband Engine Architecture

More information

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems ( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software cache

More information

Optimization of Collective Communication in Intra- Cell MPI

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI M. K. Velamati 1, A. Kumar 1, N. Jayam 1, G. Senthilkumar 1, P.K. Baruah 1, R. Sharma 1, S. Kapoor 2, and A. Srinivasan 3 1 Dept. of Mathematics

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Amir Khorsandi Spring 2012

Amir Khorsandi Spring 2012 Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,

More information

Revisiting Parallelism

Revisiting Parallelism Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 Era of Speculative, OOO 1000 Thread

More information

A Brief View of the Cell Broadband Engine

A Brief View of the Cell Broadband Engine A Brief View of the Cell Broadband Engine Cris Capdevila Adam Disney Yawei Hui Alexander Saites 02 Dec 2013 1 Introduction The cell microprocessor, also known as the Cell Broadband Engine (CBE), is a Power

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Crypto On the Playstation 3

Crypto On the Playstation 3 Crypto On the Playstation 3 Neil Costigan School of Computing, DCU. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET funded. Playstation

More information

Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS

Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS Programming IBM PowerXcell 8i / QS22 libspe2, ALF, DaCS Jordi Caubet. IBM Spain. IBM Innovation Initiative @ BSC-CNS ScicomP 15 18-22 May 2009, Barcelona, Spain 1 Overview SPE Runtime Management Library

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

Developing Code for Cell - Mailboxes

Developing Code for Cell - Mailboxes Developing Code for Cell - Mailboxes Course Code: L3T2H1-55 Cell Ecosystem Solutions Enablement 1 Course Objectives Things you will learn Cell communication mechanisms mailboxes (this course) and DMA (another

More information

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.

Neil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research. Crypto On the Cell Neil Costigan School of Computing, Dublin City University. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET

More information

Cell Processor and Playstation 3

Cell Processor and Playstation 3 Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Compiling Effectively for Cell B.E. with GCC

Compiling Effectively for Cell B.E. with GCC Compiling Effectively for Cell B.E. with GCC Ira Rosen David Edelsohn Ben Elliston Revital Eres Alan Modra Dorit Nuzman Ulrich Weigand Ayal Zaks IBM Haifa Research Lab IBM T.J.Watson Research Center IBM

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

QDP++ on Cell BE WEI WANG. June 8, 2009

QDP++ on Cell BE WEI WANG. June 8, 2009 QDP++ on Cell BE WEI WANG June 8, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009 Abstract The Cell BE provides large peak floating point performance with

More information

SPE Runtime Management Library. Version 1.1. CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series

SPE Runtime Management Library. Version 1.1. CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series SPE Runtime Management Library Version 1.1 CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series June 15, 2006 Copyright International Business Machines Corporation,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal Cell System Features Heterogeneous multi-core system architecture Power Processor Element for control tasks

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Lecture 12: Instruction Execution and Pipelining. William Gropp

Lecture 12: Instruction Execution and Pipelining. William Gropp Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Cell Programming Maciej Cytowski (ICM) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany Agenda Introduction to technology Cell programming models SPE runtime management

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine Andreea Sandu, Emil Slusanschi, Alin Murarasu, Andreea Serban, Alexandru Herisanu, Teodor Stoenescu University

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Cell Broadband Engine Overview

Cell Broadband Engine Overview Cell Broadband Engine Overview Course Code: L1T1H1-02 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn An overview of Cell history Cell microprocessor highlights Hardware architecture

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

A Transport Kernel on the Cell Broadband Engine

A Transport Kernel on the Cell Broadband Engine A Transport Kernel on the Cell Broadband Engine Paul Henning Los Alamos National Laboratory LA-UR 06-7280 Cell Chip Overview Cell Broadband Engine * (Cell BE) Developed under Sony-Toshiba-IBM efforts Current

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Portable Parallel Programming for Multicore Computing

Portable Parallel Programming for Multicore Computing Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

An Introduction to the Cell Broadband Engine Architecture

An Introduction to the Cell Broadband Engine Architecture An Introduction to the Cell Broadband Engine Architecture Owen Callanan IBM Ireland, Dublin Software Lab. Email: owen.callanan@ie.ibm.com 1 Cell Programming Workshop 5/7/2009 Agenda About IBM Ireland development

More information

Cell Broadband Engine Processor: Motivation, Architecture,Programming

Cell Broadband Engine Processor: Motivation, Architecture,Programming Cell Broadband Engine Processor: Motivation, Architecture,Programming H. Peter Hofstee, Ph. D. Cell Chief Scientist and Chief Architect, Cell Synergistic Processor IBM Systems and Technology Group SCEI/Sony

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information