Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

Size: px

Start display at page:

Download "Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University"

Owen Richards
5 years ago
Views:

1 Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University

2 THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor developed jointly by IBM, Sony and Toshiba Novel Heterogeneous architecture 1 control-intensive 64-bit PowerPC core + 8 computeintensive SIMD co-processors (SPEs) Supercomputer on a chip Cell Highlights Peak speeds of Gflop/s in SP and Gflop/s in DP (only SPEs) GB/s internal EIB bandwidth 25.6 GB/s Memory bandwidth 76.8 GB/s I/O bandwidth

3 MPI on Cell BE

4 CELL FOR HPC -THE CHALLENGE Cell has tremendous potential for scientific computation and has generated a lot of interest in the HPC community Source : "The Potential of the Cell Processor for Scientific Computing", Proceedings of the ACM International Conference on Computing Frontiers, Software has to be rewritten to specifically use the multiple cores in an effective manner Applications need significant changes to fully exploit the novel architecture For Cell to be a success, software adoption is the key Cell is a challenging environment for software development Unconventional architecture of SPEs Small size of the SPE Local Store (LS) 256KB Cell-specific programming models Control plane vs. Data plane processing Is there a way to utilize the enormous potential of the Cell, and at the same time avoid the painful process of rewriting software?

5 OUR SOLUTION Implementation of MPI 1 Enables running of a large number of scientific applications without much effort Salient Features of our Implementation Cell as an 8-node SMP (Intra-Cell MPI) 16 SPEs can be used on Cell Blade, but NUMA aspects come into play Core features of MPI 1 have been implemented The results we obtained by running on a 3.2 GHz Cell Blade, are encouraging

6 Intra-Cell MPI Design Choices Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance MPI design choices Application data in: (i) local store or (ii) main memory MPI data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Collective calls can: (i) synchronize or (ii) not synchronize

7 Blocking Point to Point Communication Blocking point to point Communication calls form the core component of MPI Collective communication calls can be built on top of point to point communication calls MPI_Send MPI_Recv MPI_Bsend MPI_Ssend MPI_Rsend

8 Fully SPE-Centric Approach with Data in Main Memory Application data is present in main memory. N Message buffers in located in the Main Memory Meta Data buffers are also located in Local Store SPEs perform the buffer management. SPEs interact with the PPE only during initialization SPEs move the message form sender s buffer to receivers buffer using buffers in local store Location Tag Datatype Size Flag

9 DESIGN ISSUES Lock Free Data Structure Local store is single ported. That is, at a given clock cycle, either a DMA operation can access the local store or the load store unit of the SPE can access it. DMA writes are in units of 128 bytes. The meta-data entry is less than 128 bytes and will therefore be seen in full, or not seen at all

10 Communication Performance Best latency obtained is 0.41 µs For 0-byte message the send involves one DMA for meta-data transfer and the receive operation does a DMA for signaling the sender which accounts for half of the latency. Bandwidth obtained was just above 6 GB/s In the presence of congestion the peak bandwidth dropped down to 4.48 GB/s

11 Comparison of performance of MPICELL with other MPI implementations MPI/Platform Latency (0 Byte) Maximum throughput Cell 0.41 µs 6.01 GB/s Cell Congested NA 4.48 GB/s Cell Small 0.65 µs GB/s Nemesis/Xeon 1.1 ms 0.65 GB/s Shm/Xeon 1.3 ms 0.5 GB/s Open MPI/Xeon 2.8 ms 0.5 GB/s Nemesis/Opteron» 0.35 ms» 1.5 GB/s Open MPI/Opteron 0.7 ms 1.0 GB/s TMPI/Origin ms GB/s

12 Application Performance Achieved double precision throughput of 5.85 Gflop/s for matrices of size 512 and 5.66 Gflop/s for matrices of size 1024 in matrix multiplication. Achieved a peak double precision throughput of 7.8 Gflops/s for matrices of size of 1024 for matrix vector multiplication While this is low compared to the peak performance of Gflop/s, the Cell still gives better results than on an Opteron processor, for the same algorithm. A 2 GHz dual processor (not dual core) Opteron with Gigabit interconnect yielded 40 Mflop/s for the same algorithm with one worker processor, and 210Mflop/s with seven worker processors (the same as for the Cell). Comparison of speedup obtained for DP matrix vector multiplication on Cell SPEs with that on an SMP of 2GHz dual processor Opterons.

13 Barrier Broadcast Scatter Gather Allgather Alltoall Collective Communication Calls

14 Factors required in optimizing the collective calls are a. Assigning Effective unique Ids (ranks) b. Using a Efficient algorithms Work Done : Effective Ranks: Experiments were performed on the EIB to understand the reasons of Bandwidth Reduction. Based on above finding several ranking schemes were studied to get effective ranks. Efficient Algorithms: a. Several algorithms were implemented to choose the best one among them. Effective Ranks + Efficient Algorithm = Optimized Collective Call

15 Summary The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Thread SPE affinity plays an important role in optimizing the performance of collective calls on intra-cell BE.

16 Programming for Performance on the Cell BE processor 09/26/08

17 Outline Introduction to Cell BE architecture IBM Software Development Kit Programming on Cell BE o Programming the SPE o Program the communication between PPE and SPE Optimizations specific to SPE Performance Analysis Tools o Static profiling o Dynamic profiling

18 Overview of the Cell Broadband Engine One Power Processor Element (PPE) & Eight Synergistic Processor Element (SPE). Heterogeneous Cel multi-core architecture o Power Processor Element for control tasks o Synergistic Processor Elements for dataintensive processing Synergistic Processor Element (SPE) Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) o o Data movement and synchronization Interface to high-performance Element Interconnect Bus Element Interconnect Bus (EIB) Source: Some of the figures & material on Cell BE used in this presentation are collected from various workshop presentations, webpages and books of IBM Sri Sathya Sai University, Prasanthi Nilayam

19 How Cell BE addresses these Walls Increased efficiency and performance: Attacks on the Power Wall Non Homogenous Coherent Multiprocessor High design a low operating voltage with advanced power management 45nm: 0.8v, 50w, 3.2GHz. Attacks on the Memory Wall Streaming DMA architecture 128 simultaneous transfers between the eight SPE local stores and main storage. 3-level Memory Model: Main Storage, Local Storage, Register Files Attacks on the Frequency Wall Highly optimized implementation Large shared register files and software controlled branching to allow deeper pipelines Sri Sathya Sai University, Prasanthi Nilayam

20 Micro Architectural Decisions Large shared register file Local store size tradeoffs Dual issue, In order Software branch prediction SPE Channel interface Sri Sathya Sai University, Prasanthi Nilayam

21 Cell BE Peak Performance In single precision, in one cycle each SPE can: o Process a four element vector, o Perform two operations on each element. 8 (SPE) x 3.2GHz x 4 ( four 32 bit words in a vector) x 2 (Multiply-Adds are counted as 2 operations) = SP Gflops Each SPE is capable of 25.6 SP Gflops in one cycle the VMX on the PPE can: process a four element vector, and perform two operations on each element. 4 x 2 x 3.2 GHz = 25.6 SP Gflops. Sri Sathya Sai University, Prasanthi Nilayam

22 Cell BE Peak Performance In double precision, Every seven cycles each SPE can: o Process a two element vector, o Perform two operations on each element. 8(SPE) x 3.2GHz x 2 ( two 64 bit double words in a vector) x 2 (Multiply-Adds counted as 2 operations) / 7 = DP Gflops In one cycle the FPU on the PPE can: process one element, perform two operations on the element 2 x 3.2 GHz = 6.4 Gflops.

23 Cell BE Programming Near Theoretical-maximum Performance is attainable on real applications. Need to be aware of the architectural characteristics --- Multiple heterogeneous execution units, SIMD, Limited Local Store, Software Managed Cache, Memory access latencies, dual issue rules, both large and wide register files, quad-word memory accesses, branch prediction and synchronization facilities.

24 Programmer Experience Cell Software Environment Code Dev Tools Samples Workloads Demos End- User Experie nce Development Environment Debug Tools SPE Management Lib Application Libs Execution Environment Development Tools Stack Performance Tools Linux PPC64 with Cell Extensions Verification Hypervisor Miscellaneous Tools Hardware or System Level Simulator Standards: Language extensions ABI Sri Sathya Sai University, Prasanthi Nilayam

25 Simulator Overview Console Window Application Source Code Development Environment: Software Stack: Running on SystemSim SystemSim: Simulation of hardware Real Systems: Programming Tools Programming Model OpenMP MPI Compilers Executables Runtime and libraries System Software: Hypervisor, Linux/PPC or K42 PowerPC BE Linux (Fedora Core 7) PowerPC Intel x86 x86-64 SystemSim ROM Disks DMA UART Caches (L1/L2) Int Ctrlr Memory Bus L3 GUI Windows Traces Source : IBM Workshop presentations

26 Cell Simulator Environment Execution Environment Source : IBM Georgia TechWorkshop presentations

27 Cell BE Programming Practices Programming the PPE is straight forward. PPE is multi-threaded SIMD Offload as much computation onto the SPEs as possible. Use PPE as a control processor...

28 How to create SPE threads Basic Steps in creating SPE threads 1. Loads the SPE program to the LS. 2. Instructs the SPE to execute the SPE program. 3. Transfers required data from the main memory to the LS. 4. Processes the data in the LS in accordance with the requirements. 5. Transfers the processed result from the LS to the main memory. 6. Notifies the PPE program of the termination of processing. source: Cell Programming Primer

29 SPE program execution The SPE program is Invoked from the PPE program as follows: Open the SPE program image. spe_program_handle_t * spe_image_open(const char *filename); 1. Create the SPE context. spe_context_ptr_t spe_context_create(unsigned int flags, spe_gang_context_ptr_t gang); 2. Load the SPE program into the LS. int spe_program_load(spe_context_ptr_t spe, spe_program_handle_t *program); 3. Execute the loaded SPE program. int spe_context_run(spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, spe_stop_info_t *stopinfo); 4. Destroy the SPE context. int spe_context_destroy(spe_context_ptr_t spe); 5. Close the SPE program image. int spe_image_close(spe_program_handle_t *program); Sri Sathya Sai University, Prasanthi Nilayam

30 What is different about the BE architecture? The memory subsystem SPE memory architecture o each SPE has its own flat local memory o management of this memory is by explicit software control o code and data are moved into and out of this memory by DMA o programming the DMA is explicit in SPE code (or in PPC code) Many DMA transactions can be in flight simultaneously oe.g. each SPE can have 16 simultaneous outstanding DMA requests oa DMA request can be a list of DMA requests odma latencies can be hidden using multiple buffers and loop blocking in code in contrast to traditional, hierarchical memory architectures that support few simultaneous memory transactions Implications for programming the BE applications must be partitioned across the processing elements, taking into account the limited local memory available to each SPE

31 Cell s Primary Communication Mechanisms DMA transfers, mailbox messages, and signal-notification Local Store SXU Data Bus Snoop Bus Control Bus Xlate Load/Store MMIO All three are implemented and controlled by the SPE s MFC DMA queue 16 entries for SPU 8 entries for PPU DMA Engine Atomic Facility Bus I/F Control MMU DMA Queue RMT MMIO

32 MFC Commands Main mechanism for SPUs to access main storage (DMA commands) maintain synchronization with other processors and devices in the system (Synchronization commands) MFC commands can be issued either by SPU via its MFC. Code running on the SPU issues an MFC command by executing a series of writes and/or reads using channel instructions read channel (rdch), write channel (wrch), and read channel count (rchcnt). PPE and other device can issue MFC commands as follows: Code running on the PPE or other devices issues an MFC command by performing a series of stores and/or loads to memory-mapped I/O (MMIO) registers in the MFC MFC commands are queued in one of two independent MFC command queues: MFC SPU Command Queue For channel-initiated commands by the associated SPU MFC Proxy Command Queue For MMIO-initiated commands by the PPE or other device

33 DMA Commands MFC commands that transfer data are referred to as DMA commands Transfer direction for DMA commands referenced from the SPE Into an SPE (from main storage to local store) mfc_get mfc_get( lsaddr, ea, size, tag_id, tid, rid); lsaddr :- target address in SPU local store for fetched data (SPU local address) ea size :- effective address from which data is fetched (global address) :- transfer size in bytes tag_id :- tag-group identifier tid rid :- transfer-class id :- replacement-class id Out of an SPE (from local store to main storage) mfc_put

34 DMA Commands Channel Control Intrinsics spu_writech Composite Intrinsics spu_dmfcdma32 MFC Commands mfc_get defined as macros in spu_mfcio.h For details see: SPU C/C++ Language Extensions

35 DMA Characteristics DMA transfers o transfer sizes can be 1, 2, 4, 8, and n*16 bytes (n integer) o maximum is 16KB per DMA transfer o 128B alignment is preferable DMA command queues per SPU o 16-element queue for SPU-initiated requests o 8-element queue for PPE-initiated requests o SPU-initiated DMA is always preferable DMA tags o each DMA command is tagged with a 5-bit identifier o same identifier can be used for multiple commands o tags used for polling status or waiting on completion of DMA commands DMA lists o a single DMA command can cause execution of a list of transfer requests (in LS) o lists implement scatter-gather functions o a list can contain up to 2K transfer requests

36 DMA Example: Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { } unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); Read contents of mem_addr into ls_addr Set tag mask Wait for all tag DMA completed Sri Sathya Sai University, Prasanthi Nilayam

37 DMA Example: Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { } unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); Write contents of mem_addr into ls_addr Set tag mask Set tag mask Sri Sathya Sai University, Prasanthi Nilayam

38 SPE SPE DMA Address in the other SPE s local store is represented as a 32-bit effective address (global address) SPE issuing the DMA command needs a pointer to the other SPE s local store as a 32-bit effective address (global address) PPE code can obtain effective address of an SPE s local store: #include <libspe.h> speid_t speid; void *spe_ls_addr;.. spe_ls_addr = spe_get_ls(speid); Effective address of an SPE s local store can then be made available to other SPEs (e.g. via DMA or mailbox) Sri Sathya Sai University, Prasanthi Nilayam

39 MailBoxes Each MFC provides 3 mailboxes SPE-outbound mailbox queue o SPE writes PPE reads. o 1-deep o SPE stalls writing to full mailbox. SPU outbound interrupt mailbox o Same as above but interrupt is posted to ppe when mailbox is written. SPU inbound mailbox queue o PPE writes SPE reads. o 4-deep o Can be overwritten Sri Sathya Sai University, Prasanthi Nilayam

40 Performance Optimization Tips Cell micro-architecture features are exposed to not only its compilers but also its applications performance gains from tuning compilers and applications can be significant tools/simulators are provided to assist in performance optimization efforts Sri Sathya Sai University, Prasanthi Nilayam

41 Double Buffering Consider an SPE program that requires large amount of data from main memory. To achieve this data transfer the scheme employed is o Start the DMA transfer from main storage to buffer B in LS. o Wait for the DMA transfer to complete. o Use the data in buffer B. o Repeat. A lot of time is wasted waiting for DMA transfer to complete. We can speed up the process by o Allocating two buffers B0 and B1 o Overlapping computation of one buffer with data transfer in other. Double Buffering is a form of multi-buffering where multiple buffers are used in circular queue to overlap data transfer with processing.

Overlap DMA with computation Double or multi-buffer code or (typically) data Example for double buffering n+1 data blocks: o Use multiple buffers in local store o Use unique DMA tag ID for each

42 Overlap DMA with computation Double or multi-buffer code or (typically) data Example for double buffering n+1 data blocks: o Use multiple buffers in local store o Use unique DMA tag ID for each buffer o Use fence commands to order DMAs within a tag group o Use barrier commands to order DMAs within a queue The purpose of double buffering is to o o maximize the time spent in the compute phase of a program minimize the time spent waiting for DMA transfers to complete Sri Sathya Sai University, Prasanthi Nilayam

43 Start DMAs from SPU Use SPE-initiated DMA transfers rather than PPE-initiated DMA transfers, because o there are more SPEs than the one PPE o the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16 Sri Sathya Sai University, Prasanthi Nilayam

44 EIB Data Topology Nominal design speed for PPE core and SPUs in 3.2GHz Bus speed is half the SPU speed, Peak bus bandwidth is 96B per core cycle each way Four rings, max 3 simultaneous transfers per ring at 8B per core cycle each Theoretical peak bandwidth: G Bytes/s.

45 Design for Limited Local Store The Local Store holds up to 256 KB for the program, stack, local data structures, and DMA buffers. Most performance optimizations put pressure on local store (e.g. multiple DMA buffers) Sri Sathya Sai University, Prasanthi Nilayam

46 SIMD SIMD exploits data-level parallelism a single instruction can apply the same operation to multiple data elements in parallel SIMD units employ vector registers oeach register holds multiple data elements SIMD is pervasive in the BE oppe includes VMX (SIMD extensions to PPC architecture) ospe is a native SIMD architecture (VMX-like) SIMD in VMX and SPE o 128bit-wide datapath o 128bit-wide registers o 4-wide fullwords, 8-wide halfwords, 16-wide bytes o SPE includes support for 2-wide doublewords

47 SPE Pipeline and Dual-Issue Rules SPE has two pipelines, even(pipeline 0) and odd (pipeline 1) SPE can issue and complete two instructions per cycle Sri Sathya Sai University, Prasanthi Nilayam

48 Sri Sathya Sai University, Prasanthi Nilayam Instruction Scheduling

49 Loop Unrolling Unroll loops to reduce dependencies increase dual-issue rates This exploits the large SPU register file. Compiler auto-unrolling is not perfect, but pretty good. Example: for(i=1, i<100, i++) { } a[i] = b[i+2] * c[i-1]; for(i=1, i<99, i+=2) { a[i] = b[i+2] * c[i-1]; a[i+1] = b[i+3] * c[i]; }

50 Loop Unrolling Performance Sri Sathya Sai University, Prasanthi Nilayam

51 SPU Timing tool Static Profiling Provides a static timing analysis of compiled SPE code based on issue rules, pipeline latencies, and static dependencies assumes all branches not taken cannot account for data-dependent behaviour sscal (scaling a vector)

52 Sri Sathya Sai University, Prasanthi Nilayam SPU Software Pipeline

53 Simulator Dynamic Profiling. Sri Sathya Sai University, Prasanthi Nilayam

54 Benchmarking on the Hardware Use SPE Decrementer Register value To time a piece of code Starttimer // Initialize decrementer register some large value. { Code to be timed} Stoptimer //read decrementer register value (Start stop) / DECREMENTER_FREQ gives the time taken in sec. We can get the timebase information using the command: /proc/cpuinfo Sri Sathya Sai University, Prasanthi Nilayam

55 Branch Optimizations SPE Heavily pipelined high penalty for branch misses (18 cycles) Hardware policy: assume all branches are not taken Advantage Reduced hardware complexity Faster clock cycles Increased predictability Solution approaches If-conversions: compare and select operations Predications/code re-org: compiler analysis, user directives Branch hint instruction (hbr, 11 cycles before branch) Sri Sathya Sai University, Prasanthi Nilayam

56 Sri Sathya Sai University, Prasanthi Nilayam Branches

57 Integer Multiplies Avoid integer multiplies on operands greater than 16 bits SPU supports only a 16-bit x16-bit multiply 32-bit multiply requires five instructions (three 16-bit multiplies and two adds) Keep array elements sized to a power-of-2 to avoid multiplies when indexing. Cast operands to unsigned short prior to multiplying. Constants are of type int and also require casting. Use a macro to explicitly perform 16-bit multiplies. This can avoid inadvertent introduction of signed extends and masks due to casting. #define MULTIPLY(a, b)\ (spu_extract(spu_mulo((vector unsigned short)spu_promote(a,0),\ (vector unsigned short)spu_promote(b, 0)),0)) Sri Sathya Sai University, Prasanthi Nilayam

58 Sri Sathya Sai University, Prasanthi Nilayam Avoid Scalar Code

59 Summary Get code running on PPU (easy port) Deal with memory alignment concerns Identify candidate code for porting to SPUs Parallelize problem for SPE utilization Establish communication methodology Port scalar code to SPUs Optimize data transfers (double-buffering) SIMDize code and unroll loops Use Performance Analysis Tools 09/26/08

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and