Phylogenetic Inference on Cell Processors

Size: px

Start display at page:

Download "Phylogenetic Inference on Cell Processors"

Berniece Beverly Cain
5 years ago
Views:

1 Department of Computer Science University of Aarhus Denmark Phylogenetic Inference on Cell Processors Master s Thesis Martin Simonsen September 11, 2008 Supervisor: Christian Nørgaard Storm Pedersen

2 Abstract Phylogenetics is a problem in bioinformatics, which requires high performance hardware and fast algorithms to process a constantly increasing amount of data. This thesis addresses both issues by presenting a new search heuristic, called RapidNJ, for the famous Neighbour- Joining method, and by evaluating the Playstation 3 Cell processors applicability in distance based phylogenetic inference. RapidNJ significantly reduces the execution time of the Neighbour- Joining method and makes construction of large phylogenetic trees feasible. Experiments are presented, which show that RapidNJ outperforms both state of the art heuristics and approximation algorithms. The Cell processor has attracted much attention because of its high performance compared to mainstream processors and has proved useful for efficiently solving some problems in bioinformatics. This thesis evaluates the applicability of the Cell processor in Phylogenetics and other computational intensive problems. The evaluation is based on a survey of the novel Cell architecture, an investigation of how the Cell architecture affects software development, and a series of experiments. The experiments compare the performance of a Cell processor to that of a contemporary mainstream processor when executing the Neighbour- Joining method. 1

3 Contents 1 Introduction Background and Motivation Objectives Thesis Outline Phylogenetics and Phylogenetic Inference Phylogenetic Trees Methods for Phylogenetic Inference Distance-matrix Methods Parsimony Statistical Methods The Cell Processor History and Motivation Architecture Overview PowerPC Processor Element Synergistic Processor Element Synergistic Processor Unit and Local Store Memory Flow Controller Element Interconnect Bus Performance Discussion and Evaluation Cell Software Development The Cell Software Development Kit Programming the Power Processor Element Programming the Synergistic Processor Element Compilers Programming Models Service Model Multistage Pipeline Model Parallel Stages Model Shared Memory Multiprocessor Model Software Development Tools Libraries The IBM Full-System Simulator Debugging Tools Integrated Development Environment Performance Tools

4 5 Experiments Implementations Parallising Neighbour-Joining for the Cell Processor Scalar Implementations Loop Unrolling Implementations Partial SIMD Implementations Full SIMD Implementations Multi-threaded x86 Implementations RapidNJ Implementation ALF implementation Hardware and Software Setup Data Results X86 implementations Cell implementations Cell vs X86 implementations Discussion Conclusion 62 References 63 A Rapid Neighbour-Joining 67 3

5 1 Introduction Several areas in the scientific community depend on fast computers to deliver new results. This is especially true in bioinformatics, where a constant increase in computer power is necessary to keep up with the increasing amount of available data. Unfortunately, most traditional microprocessor architectures have reached their maximum performance potential, and if we want to keep increasing the performance of processors, new ideas are needed. The Cell Broadband Engine Architecture (CBEA) is a new and revolutionary microprocessor architecture, which received quite a lot of attention because of its ability to deliver large amounts of computing power, compared to mainstream processor architectures, both in theory and practice. CBEA could be the solution to the stagnant growth in microprocessor performance we are experiencing, and the first implementation of CBEA has already been applied successfully to some problems in bioinformatics [7, 54]. 1.1 Background and Motivation Traditional single core microprocessor designs have been able to keep up with Moore s law 1 for almost 40 years, but now these designs faces three obstacles. The power, memory and frequency wall (see section 3.1 for a description of the three walls). These obstacles are forcing microprocessor designers to invent new techniques for improving performance of existing architectures. Multi-core processors are becoming the standard and currently they are the best solution we have to the three performance walls. However, because today s mainstream multi-core CPUs are built on top of single core architectures, they lack support for high performance parallel programming, which makes full utilisation of more than one core a challenge. CBEA is a heterogeneous multi-core design that promises to overcome all three walls and deliver huge amounts of computing power. The first implementation of CBEA is the Cell BE (Cell) processor which is used to power Sony s Playstation 3 (PS3) shown in Fig. 1. Unlike most other game consoles, the PS3 can be used both for playing computer games and as an inexpensive source of computing power, which the popular project Folding at home (FAH) [7] is a good example of. FAH allows anyone, owning a PC with internet connection, to participate in the search for new drugs by folding proteins. After the PS3 was introduced a FAH-client was developed for the PS3, which resulted in a massive increase of FAH s total computing power, even though PS3 s only makes up a small fraction of computers connected to the grid [8]. My previous experiences with bioinformatics combined with the success of FAH inspired me to explore the possibility of using the Cell processor for 1 The number of transistors in processors doubles every two years 4

6 Page 1 of 1 Figure 1: Sonys playstation 3 game console solving problems in the area of bioinformatics. In addition, multi-core processors seem to be the future and it would be interesting to learn more about the newest technology in this area. I have decided to focus on phylogenetic inference because some of the existing methods seemed to be well suited for the CBEA, and because I already had some experience with Phylogenetics. 1.2 Objectives The main objective of this thesis is to evaluate the applicability of the Cell processor in distance based phylogenetic inference and give an assessment of the Cells applicability in other areas that require high performance processors. I also aim to give an introduction of the most essential elements of the Cell architecture, discuss how it compares to contemporary mainstream and present a survey of tools and techniques for Cell software development. Additionally a new search heuristic for the Neighbour-Joining method [47] (NJ), which was developed as a part of this thesis, is presented as a published article. 1.3 Thesis Outline Section 2 presents the three main groups of phylogenetic inference methods with focus on distance-based methods like the NJ method. The subjoined article Rapid Neighbour-Joining, found in appendix A, should be read as a part of this section. Section 3 introduces the Cell architecture and contains a discussion of how it compares to contemporary mainstream processors. Section 4 introduces Cell software development and presents a survey of the most important tools and techniques for Cell software development. Section 5 presents different implementations of NJ using various optimisation techniques and results of experiments with these implementations on both a Cell and a modern x86 processor 2. The results are used to discuss 2 A processor which supports the x86 instruction set 5

7 and evaluate the Cell processors applicability in distance based phylogenetic inference and other areas. 2 Phylogenetics and Phylogenetic Inference Through evolution, populations of the same species can develop into separate species, where some will become extinct while other evolves again. Since Charles Darwin first described this process in his book On the Origin of Species, biologists have tried to piece together the evolutionary history of all living organisms. This study of evolutionary relations between life forms is known as Phylogenetics. The evolutionary relationship between living organisms, can rarely be determined exactly, so we try to infer the most likely relationship from the available data. This process is called phylogenetic inference, and has traditionally been done by studying fossils and living organisms. Today genetic information have become available in large quantities and by using various methods implemented as computer algorithms, we can infer large phylogenies quite accurately. 2.1 Phylogenetic Trees Phylogenetic trees or evolutionary trees are a common tool for illustrating evolutionary relationships between spices. Figure 2 on page 10 shows an example of such a tree, where each leaf node corresponds to a living species and the internal nodes represents hypothetical ancestors. Edges in a phylogenetic tree can be annotated with lengths that represent the evolutionary distance (e.g. time) between two nodes. A tree can be either rooted or unrooted. Rooted trees have a root node, which represents the common ancestor of all taxa 3 in the tree, whereas an unrooted tree only shows the relatedness of taxa. An unrooted tree can always be made from a rooted tree by omitting the root node, while a root cannot be directly inferred from a unrooted tree. Both rooted and unrooted trees can be either bifurcating or multifurcating. Bifurcating trees are trees where nodes have a maximum degree of three and can be used to visualise evolutionary models, where species evolve from one parent into two children. In multifurcating trees nodes can have a degree of more than three, thus allowing species to split into more than two distinct species. 2.2 Methods for Phylogenetic Inference New species arise because genetic material change over the course of time, creating new attributes amongst individuals, which distinguish them from 3 A group of organisms/species 6

8 the main species. By comparing genetic sequences 4 from different species, the relationship between these can be inferred with reasonable accuracy using one of several existing methods. The general idea behind these methods is to find a phylogenetic tree, which provides the best explanation of the available genetic data, given a model of evolution. Finding the true tree is almost impossible, since the number of possible trees increase rapidly with the number of taxa. Given n taxa there exists Π n 3 2n 3 rooted, Πn 3 2n 5 unrooted bifurcating trees, and even more multifurcating trees, so using exhaustive search to infer phylogenies of interesting sizes is infeasible [36]. Instead, heuristics are used either to limit the search space or to approximate trees. Existing phylogenetic inference methods are divided into three major groups: distance-matrix methods, parsimony methods and statistical methods. This section gives a short introduction to parsimony and statistical methods, while distance-matrix methods will be presented in greater detail, as they are the main topic of this thesis Distance-matrix Methods Distance-matrix methods use pairwise distances between taxa and some heuristic to construct a likely tree. The input for these methods is a distancematrix D, where entry D i,j contains the distance between taxon i and j. Such a distance is a measure of the evolutionary distance between two taxa, and may vary depending of how it was inferred. One common way of constructing distance matrices, is to use multiple sequence alignments, and calculate the distance between two sequences by looking at how much they differ. For instance if some part of two sequences in a multiple alignment do not match, perhaps because of natural mutations, a value is added to the distance of the two taxa, and if there s a match, nothing is added. Distance-matrix methods are widely used because of their simplicity, reasonable accuracy and efficiency. They do not directly incorporate an evolutionary model, but rely on the method used to construct the distance-matrix to define such a model. Most distance-matrix methods search distancematrices for a pair of taxa, which optimise an objective function. These two taxa are considered closely related and inserted into the resulting tree as children of the same hypothetical parent. Two examples of distance-matrix methods are the Unweighted Pair Group Method with Arithmetic Mean method (UPGMA) [52] and the Neighbour-Joining method which are presented in the following. UPGMA: UPGMA is a very simple method, which can reconstruct ultrametric trees given their distance-matrix. 4 A sequence of letters representing some part of a DNA strand 7

9 Definition 1. (ultrametric) Let d(x, y) be a distance metric, then a ultrametric tree T is a tree which satisfy the following condition: leafs x, y, z T d(x, z) max{d(x, y), d(y, z)} In ultrametric trees, every ancestor is equally distant from all descendants. The method assumes a constant rate of evolution and if given a distancematrix, which is not ultrametric and/or does not have a constant rate of evolution, it will often construct a tree with an inaccurate topology. Because UPGMA only works well under certain conditions, the method is not widely used. The algorithm by which UPGMA constructs trees is a simple greedy clustering algorithm. Initially n clusters, each representing a taxon, are created. In each iteration two clusters i and j, which minimise min{d(i, j)}, (1) i,j are selected and joined into a single cluster. A join corresponds to constructing a hypothetical parent node with the two taxa as children in the resulting tree. Distances from the new cluster to every other cluster, are calculated and inserted into the distance-matrix. After n iterations only one cluster remains and a rooted bifurcating tree is constructed. UPGMA can be implemented to run in O ( n 2) time by using a quadtree data structure [28]. The neighbour-joining method: The NJ method is based on the minimum evolution criterion, i.e. it aims to minimise the total branch length of the resulting tree. Given an additive distance-matrix, NJ can construct the true tree and given a non-additive matrix the method will often construct trees close to the actual minimum tree. Definition 2. (Additive) A distance-matrix D is additive if there exist a tree, where, for all pairs of leaves T i,j in the tree, the sum of edge lengths on the path connecting the two are equal to D(i, j). NJ is very similar to UPGMA and the only difference lies in the optimisation criteria. In addition to the proximity of two clusters, NJ also takes into account the distance to other clusters. Let r be the number of remaining clusters in any given iteration of NJ, then the optimisation criteria of NJ is defined as follows: where min{q(i, j) = D(i, j) u(i) u(j)}, (2) i,j r 1 u(l) = D(l, k)/(r 2) (3) k=0 8

10 Q is a matrix with the same dimensions as D, which are defined here for later reference and do not need to be represented explicitly. Clusters minimising the optimisation criteria are close to each other and distant from all other clusters. The result of NJ is an unrooted bifurcating tree. The NJ method uses O ( n 3) time [53], which is low enough to make construction of large trees feasible. Because of it s popularity several papers has been published [27,32,43,49], which presents improved search heuristics and approximation algorithms for NJ. As a by-product of my work with this thesis, I ve developed a new and very efficient search heuristic for NJ, called RapidNJ, in collaboration with T. Mailund and my supervisor C. N. S. Pedersen. It s described in a subjoined article, that has been accepted at WABI 2008 and can be found in appendix A Parsimony Maximum parsimony or parsimony is a group of character-based phylogenetic inference methods. They aim at constructing trees with a minimum number of evolutionary changes, which is referred to as the most parsimonious tree. Parsimony methods operate on character sequences representing taxa. A distance between each pair of taxa is calculated using a scoring system, where the number characters which differs in the two sequences, is multiplied with some scalar. Given n taxa, parsimony methods constructs trees with n leaves, each representing a taxon, and connects the leaves with a number of parent nodes representing hypothetical ancestors. Each node has a state in the form of a character sequence and edge lengths are the distance between the two connected nodes. Some parsimony methods do not calculate edge lengths and/or the state of parent nodes explicitly, but both can usually be reconstructed quite easily from the resulting tree. An example of a parsimony tree, where the internal node are annotated with their states and the edges are annotated with their lengths, is shown in Fig. 2. Construction of a parsimony tree can be done by searching through all possible tree topologies, but as explained in section 2.1, this is infeasible since there are too many. Instead, parsimony methods use heuristics to search for the best tree just like distance-matrix methods. An example of such a heuristic is the Nearest Neighbour Interchange [33], where an arbitrary tree is used to bootstrap the process, and by iteratively interchanging sub trees, the parsimony of the tree is improved. Branch and Bound methods are also often used to speed up parsimony algorithms and some parsimony methods use a heuristics to construct an approximation of the most parsimonious tree. Parsimony methods can produce trees close to the true tree. They are also very efficient and can be used to construct large trees in reasonable time. Unlike distance-based methods, parsimony methods incorporate a model of evolution, which is used to calculate the parsimony score of a tree. A good model is important for construction of correct trees, and it is not trivial to 9

11 construct such a model. The downside of parsimony methods is statistically inconsistency 5, which makes them unusable for inferring most kinds of trees [34]. However, in some cases parsimony methods can actually outperform more advanced statistical methods e.g. when inferring trees in the Farris zone [35]. AAA 0 1 AAA AGA AAG AAA GGA AGA Figure 2: A phylogenetic tree created by maximum parsimony from four sequences AAG, AAA, GGA and AGA Statistical Methods Statistical methods are computational intensive and used to be considered infeasible for phylogenetic inference. New methods and advances in computer hardware have changed this, and today statistical methods are very popular in Phylogenetics. Maximum Likelihood (ML) is a well know statistical method, which can be applied on a wide selection of problems. It is still beeing used successfully in Phylogenetics, and will serve as an example of a statistical method here. The general idea of ML methods, is to optimise the parameters of some model M, thereby maximising the likelihood L of observing the available data D given M where L = P (D M). In phylogenetic inference the data D could be character sequences and the model M a tree. The process of maximising the likelihood is twofold. First given a tree topology the tree must be annotated with edge lengths, which gives the most probable explanation of D. Secondly, the tree topology is improved to maximise the likelihood. Given a good model and enough data, ML methods can produce trees, which are more accurate than those constructed by both distance matrix and parsimony methods. However, there are still many computational challenges in using ML methods and they are only applicable on phylogenies with a few hundred taxa. 5 Given enough data they do not always converge to the true tree 10

12 Bayesian inference methods have recently become very popular and are currently subjected to extensive research. Used in combination with Markov Chain Monte Carlo methods they have been shown to produce some of the most accurate trees while still being able to construct trees of interesting sizes [37, 55]. Compared to both distance and parsimony methods, statistical methods are slow and complex, but most researchers resort to statistical methods when accurate trees are needed. 3 The Cell Processor This section introduces the Cell Broadband Engine Architecture as it is implemented in the Cell processor of Sony s PS3. It contains a discussion of the motivations for developing the Cell, a survey of CBEA, a discussion of how the CBEA compares to the x86 architecture and an evaluation of the design. All information regarding the Cell architecture was found using the following sources [2, 20, 22, 24 26, 41, 42, 50]. 3.1 History and Motivation Development of the Cell started in 2000, where Sony, Toshiba and IBM (STI) formed an alliance to design and produce the Cell processor. Sony as a content provider, IBM as a leading technology and server company and Toshiba as development and high volume manufacturing partner. Sony s motivation for participating in the project, was the need for a processor to power a successor of the successful but aging Playstation 2 game console. This meant focus on multimedia performance, which requires massive amounts of raw computational power and responsiveness. IBM and Toshiba both have product lines which could use a processor with these properties, so instead of designing a processor only for game consoles, the vision was to make a flexible and platform independent processor. After some negotiations during the summer of 2000, four objectives were established for the design of CBEA [42]: Outstanding performance, especially on game/multimedia applications. Real-time responsiveness to the user and the network. Applicability to a wide range of platforms. Support for introduction in The performance objective was a difficult but very important challenge, because many argue that traditional processor designs, like the x86 architecture, have reached their maximum performance potential. Three major limiters have been identified for contemporary processors [10, 24, 38, 42, 51]. 11

13 The power wall: Until recently, increasing performance of processors meant increasing the clock frequency. Higher frequencies require more power, which generates heat that needs to be dissipated or the processor will simply melt. Power usage can be lowered by using smaller component in the CPU, which requires less voltage, but now this technique has come very close to some physical limits. Components in modern processors are close to reaching the size of atoms, which is considered the limit for how small things can be made. Because components have become so small, they cannot contain the power running through them causing the power to leak, which generates additional heat. Cooling techniques have also reached some practical limits, and without reducing power consumption, processors cannot be clocked higher without melting. The memory wall: The latency of main memory has been a bottleneck for some time, because memory technology has not been able to keep up with advances in processor technology. If data cannot be fed to processors fast enough, the processors are forced to idle while waiting for data. Fast local cache 6 and better branch prediction 7 have been able to reduce this problem, but these techniques have reached their limits. Memory which can match the speed of modern processors are currently not available, and with memory latency approaching 1000 clock cycles many CPU clock cycles are often wasted on waiting for data. The frequency wall: Increasing the clock frequency of a processor will only result in a performance gain if enough instructions can be supplied to the processor. Longer pipelines 8 are needed to keep processors fully utilised, but we have reached the point of diminishing returns from longer pipelines. The performance gained from processing multiple instructions concurrently are now less than the performance lost because of increased instruction latencies. Instruction latencies are caused by e.g. branch mispredictions and data dependencies which stall 9 pipelines. Shorter pipelines and simple linear execution of instructions allows higher frequencies, but does not necessarily give better performance. Most microprocessor manufactures have abandoned traditional single core architectures because of the power and frequency wall. Multi-core pro- 6 Fast memory in proximity to the CPU, which provides fast access to prefetched data 7 Hardware in the CPU, which guesses the outcome of conditional branches, thus allowing speculative instruction execution 8 Hardware holding a sequence of instructions, which are processed concurrently like cars on an assembly line 9 The pipeline has to wait for some instructions to complete execution before it can proceed 12

14 cessors are now seen as the future, and in theory, they allow us to follow Moore s law for some time. However, using two cores will rarely entail a factor two performance increase in practise, because full utilisation of more than one core requires software to be parallelised, which is often difficult or impossible. Even if an algorithm can be parallelised, it often suffers from a high overhead because most multi-core processors only provide a few inefficient tools for controlling parallel execution. Developers have little or no control over how multiple threads are executed, and communication between two cores can only be done through shared memory with high latency. Furthermore, the main memory bandwidth is shared by all cores in most architectures, making the memory wall a limiting factor on the number of cores, which can be fully utilised. To reach the performance goal, CBEA needed to push back all three walls. Dedicated processors like GPU s 10 have successfully been doing so for some time but at the cost of flexibility. STI wanted cell to be applicable as a general-purpose processor, so they had to develop a design, which combined power and flexibility. This required a new direction in processor design that after 4 years of development was presented as the first generation Cell processor in Architecture Overview At first glance, the Cell is very different from the processors, which dominates the commercial market. CBEA is a heterogeneous architecture that creates a bridge between the power of dedicated processors like GPU s and the flexibility of general-purpose processors. Sacrifices have been made, but the theoretical single precision floating point (SPFP) performance of over 200 GFLOPS 11 has overshadowed most shortcomings. An overview of the Cell processor is shown in Fig. 3. The Cell has nine interconnected processors. Eight of these are specialised processors called Synergistic Processor Elements (SPE), which are optimised for computational intensive tasks, but can also be used as generalpurpose processors. In the PS3, only 6 of the 8 SPE s can be used, because one is deactivated and one is reserved by Sony. Each SPE has a core called the Synergistic Processor Unit (SPU), a small local memory called the Local Store (LS) and a Memory Flow Controller (MFC). The last processor is the PowerPC Processor Element (PPE), which is a 64 bit processor based on IBM s PowerPC Architecture. The PPE is the main processor and is responsible for running an operating system and controlling the SPE s. As main memory the Cell supports up to 64 GB of XDR DRAM 12, which is accessed through the Memory Interface Controller (MIC). The Cell Broad- 10 Graphics Processing Unit 11 Billion floating point operations per second 12 extreme data rate dynamic random access memory 13

15 Figure 3: Cell architecture. band Interface (BEI) provides access to two I/O interfaces which connects the Cell with the outside world. All elements sits on the same chip and are connected through the Element Interconnect Bus (EIB), which consists of four 16-byte wide data rings. 3.3 PowerPC Processor Element The PPE is a general-purpose processor which conforms to the PowerPC architecture and is therefore compatible with programs written for other PowerPC processors. Compared to other modern processors the PPE is quite simple. It has a short pipeline and whereas most modern processors are three or four-issue designs 13 with out-of-order execution 14, the PPE is only a dual issue in-order processor. The conventional 2 level cache structure consisting of 32KB L1 and 512KB L2 cache is also unimpressive compared to other modern processors, where the L2 cache can reach sizes of over 4MB. However, the simple design enables the PPE to run at high clock frequencies, and thereby still achieve a reasonable performance. The PPE does include some advanced hardware features like a Vector/SIMD 15 extension unit (VXU), which is capable of executing vector/simd multimedia extension instructions. Also, some performance critical hardware, like registers, 13 can execute three of four instructions simultaneously 14 Instructions are not necessarily executed their original order. Helps avoid stalls 15 Single Instruction Multiple Data. See section 5.1 for details on SIMD instructions 14

16 are duplicated in the PPE allowing it to execute two threads concurrently 16 like Intel s hyper-threading enabled processors. The PPE is designed for control intensive tasks and relies on the SPE s to perform computational intensive tasks. The simple design makes it power efficient and a high clock frequency gives it an acceptable performance by modern standards. The main responsibilities of the PPE is to run an operation system, manage system resources and controlling SPE s, but it can also perform computational intensive tasks by utilising the VXU. 3.4 Synergistic Processor Element An SPE is a simple general purpose vector processors with a theoretical 25.6 GFLOPS SPFP peak performance. It can execute programs independent of the PPE, but rely on the PPE to initiate the execution. SPE s are designed to execute computational intensive code while the PPE acts as a coordinator and they can be considered as the PPE s co-processors. Figure 4 shows the three units which an SPE consists of. Figure 4: SPE components Synergistic Processor Unit and Local Store The SPU is the simplistic core of an SPE. Hardware like, branch prediction and caches, have been sacrificed to make the SPU as simple and fast as possible. It operates on a single 128-bit register file with 128 entries that enables the SPU to process large amounts of SIMD instructions, which is also the only type of instruction that can be executed in an SPU. Each SPU has access to a fast 256KB LS, while an MFC provides access to main memory. SPU s can only execute instructions located in the LS, so both 16 Actually only one thread is executed, but it can quickly shift to the other thread during stalls caused by for example cache misses 15

17 instructions and data needs to be loaded into the LS before execution of a program can begin. Due to its simple design the SPU only supports 16- byte load/store operations, which are 16-byte aligned in the LS. Unaligned data have to be loaded/stored using special operations that slows down the SPU. Alignment of data in the LS has to be handled by either compilers or software developers to avoid a performance penalty. The size of problems, which can be handled by an SPE, is limited by the space available in the LS and large problems have to be divided into smaller pieces. Communication between the SPU, LS and MFC is done by 32 bit channels, and through the MFC SPU s can also communicate with the PPE and other SPE s Memory Flow Controller The MFC serves as an interface for the SPU to the rest of the Cell processor. It connects the SPE to the EIB and can communicate with other elements connected to the EIB like the main memory, other SPE s and the PPE. The most important function of the MFC is to transfer data from the main memory to the connected LS. This is done through DMA transfers issued by a program running on either the PPE or an SPE. The address space of the main storage is called the Effective Address (EA) space, and because the PPE provides MFC s with virtual memory address-translation information from the operating system running on the PPE, SPE s can perform DMA transfers using EA-pointers. When a SPE is initialised, the PPE creates an SPE-context 17 in the EAspace and parses a pointer to the SPE. This pointer is used to DMA the SPE-context into the LS, where it can be accessed by the SPE. Subsequent DMA transfers can now be initiated with EA-pointers provided in the SPEcontext. A DMA transfer is initiated by writing a LS-address, an EA and the size of the transfer to a MFC, which then takes over and completes the transfer autonomously. The maximum size of a transfer is 16KB and up to 16 DMA transfers can be enqueue in the MFC and by using a DMAlist 18 this is increased to 2048 DMA transfers. A DMA transfer command transfers data between the EA and LS address space. The block of data transfered must be of size 1, 2, 4, 8, 16 or a multiple of 16 bytes and obey the following memory alignment constraints. Source and destination addresses must have the same 4 least significant bits. For transfer sizes less than 16 bytes, addresses must be naturally aligned A user defined data structure 18 A list of dma commands in the LS 19 The starting local store and main memory address must be divisible by the size of the transfer. 16

18 For transfers of size 16 bytes or greater, addresses must be aligned to at least a 16-byte boundary. Peak performance is achieved when the size of a DMA transfer is a multiple of 128 bytes and the data is 128-byte aligned. Each MFC has memory-mapped I/O (MMIO) registers, which allows both the PPE and other SPE s to issue DMA commands to any MFC. Access to the LS address space of a remote SPE is provided by memorymapping the LS to the EA-space. The memory mapping of LS is not cache coherent, so care must be taken if the PPE is used to read a memory mapped LS. Using MMIO registers DMA transfers can be initiated from the PPE, but only eight DMA commands can be enqueued from the PPE, and DMA lists are not available. SPE-to-SPE DMA transfers are also possible and very efficient as data only passes through the EIB, which is up to 10 times faster than the MIC. SPE-to-SPE communication can be used to combat the memory wall, since no load is put on the main memory as opposed to shared memory communication. Because MFC s can process DMA commands autonomously, the initiator of a DMA transfer can continue to work on other data while a DMA transfer completes. This is a very important feature which helps maximise utilisation of bandwidths and minimise wasted clock cycles. More information on this technique can be found in section 4.3. In addition to DMA transfers the MFC also provided two inter-process communication methods which provide light weight communication between SPE s and the PPE. Signal notification allows the PPE or an SPE to send 32 bit messages (signals) to an SPE. Signals are designed to notify processes running on SPE s and cannot be send to the PPE. Each MFC has two identical signal registers which can recieve signals in two modes. OR-mode writes the message using a logical OR operation, thus allowing messages to accumulate in the register. Overwrite-mode overwrites any message in the register. Mailboxes allows an SPE to send 32 bit messages to both the PPE and other SPE s. Each MFC provides two outbound mailboxes, and one inbound mailbox. Outbound mailboxes are used for sending messages to either the PPE or another SPE and can only contain one message. The inbound mailbox is used for receiving messages and can contain up to 4 messages in a FIFO queue. Writing to a full inbound mailbox results in the newest message to be overwritten. Reading and writing to mailboxes can be either blocking or non-blocking. A SPU will block when reading an empty inbound mailbox or writing to a full 17

19 outbound mailbox, while the PPE will never block. The status of a mailbox can be checked before reading from or writing to it. Signal notification and mailboxes allows complicated communication patterns to be implemented. They are easy to use and much more efficient than shared memory communication. 3.5 Element Interconnect Bus Figure 5: The topology of the EIB The EIB is a data bus connecting all elements in the Cell processor. The topology of the EIB and all connected elements are illustrated in Fig. 5. It consists of four 16-byte wide data rings, where two run clockwise and two counter-clockwise. All connected elements have an on and off ramp which allows data to be sent and received at the same time. A data arbiter controls the data flow on the EIB and is responsible for granting access to the EIB, and making sure that data always takes the shortest route around the EIB. As seen in Fig. 5 the EIB is divided into segments with one element in each segment. The data arbiter allows multiple data transfers on the same ring as long as the segments, through which the data travels, do not overlap. This design gives the EIB a huge bandwidth, if communication between the connected elements can be localised to the nearest neighbour. 3.6 Performance With a clock speed of 3.2 GHz, the Cell processor in a PS3 can deliver over 200 GFLOPS in theory but since only six SPE s can be used, it only de- 18

20 livers a bit more than 150 GFLOPS. In comparison, modern mainstream CPU s delivers under 40 GFLOPS [16], while GPU s can reach over 500 GFLOPS [6]. However, looking at the theoretical FLOP performance is not always a good estimate of performance in practice. Peak performance can only be reached if the processor is able to perform a calculation in every clock cycle. In most practical applications, many clock cycles are wasted on e.g. memory latencies and inter-core communication, so the ability to utilise clock cycles should also be taken into account when evaluating the performance a processor. Due to asynchronous DMA transfers and fast inter-core communications, the Cell processor can achieve a very high very utilisation of clock cycles on all nine processors. Experimental results show that a peak performance of over 100 GFLOPS [9, 26] is possible in practise, and an impressive peak performance of over 200 GFLOPS has been reported by a team in Germany who used a Cell processor to do matrix arithmetics [39]. Full utilisation of the Cell cannot be expected in general, but these results show that the Cell can be applied to real problems and achieve an outstanding performance. The FLOPS performance is usually measured using only SPFP operations which are important for some applications like multimedia applications, but scientific applications often requires double precision floating point (DPFP) and here the Cell is not so strong. Because the Cell was designed with focus on the multimedia applications, double precision performance was not a priority and the SPE s can only achieve a peak performance of 14 GFLOPS in total and the PPE 7 GFLOPS using DPFP calculations. The SPE s are particularly bad at executing DPFP instructions, because unlike SPFP instructions these cannot be dual issued and stalls the pipeline for several cycles. The PPE can handle DPFP instructions efficiently, but it does not support DPFP SIMD instructions. The PS3 uses state of the art XDR DRAM to reduce memory latencies. With a bandwidth of 25.6 GB/s XDR RAM are several times faster than DDR2 RAM, which most modern PC s are equipped with. DDR2 RAM modules have a peak bandwidth of less than 7 GB/s, but the next generation DDR3 RAM is available and with twice the bandwidth of DDR2 RAM, they will allow better utilisation of multi-core processors. XDR RAM is very fast but also quite expensive, and with only 256 MB RAM the PS3 will be unable to solve some problems. Even the impressive bandwidth of XDR DRAM, is not always enough to supply all nine processors in the Cell processor with data. The data flow to and from main memory needs to be kept at a minimum, and it is here the novel memory structure of the Cell comes into play. Traditionally automated cache structures, as the one found in the PPE, have been used to reduce the need for main memory access. The hardware, which controls such cache structures is power consuming and is not always very good at guessing which data is needed next. The LS of an SPE could be regarded as a manually 19

21 controlled cache, and with a bandwidth of 51.2 GB/s between the SPU and LS, the LS can match the speed of most conventional caches. Because the LS is controlled directly through software, developers can create data transfer schemes, which always fetches the correct data and thereby minimises the data flow between main memory and SPE s. With a peak bandwidth of GB/s, the EIB provides very fast SPEto-SPE transfers that can offload the main memory. It is, however, not an easy task to construct algorithms that utilises SPE-to-SPE communication, and because the topology of SPE s has to be taken into account, it becomes even harder. The peak performance of SPE-to-SPE communication is only attained if each SPE is restricted to communicating with neighbouring SPE s on the EIB. Table 1 shows how much of the EIB bandwidth, that can be utilised with different communication patterns between SPE s. Use Fig. 5 as reference for the topology. Table 1: Sustained EIB bandwidth achieved for some SPE-to-SPE DMA transfers SP E1 SP E3, SP E5 SP E7, SP E0 SP E2, SP E4 SP E6 SP E0 SP E4, SP E1 SP E5, SP E2 SP E6, SP E3 SP E7 SP E0 SP E1, SP E2 SP E3, SP E4 SP E5, SP E6 SP E7 SP E0 SP E3, SP E1 SP E2, SP E4 SP E7, SP E5 SP E6 SP E0 SP E7, SP E1 SP E6, SP E2 SP E5, SP E3 SP E4 SP E0 SP E5, SP E1 SP E4, SP E2 SP E7, SP E3 SP E6 SP E0 SP E6, SP E1 SP E7, SP E2 SP E4, SP E3 SP E5 186 GB/s 197 GB/s 197 GB/s 197 GB/s 78 GB/s 95 GB/s 197 GB/s Power usage is also an important factor for today s high performance processors. Both because electricity is expensive and the heat generated by high power consumption. With an estimated power consumption of 100W [3], the Cell has an impressive FLOPS/Watt performance and new versions of the Cell will be made even more power efficient. In comparison most mainstream processor models from Intel consumes only around 65W, while their top models consumes up to 150W [5]. The FLOPS/Watt performance makes the Cell attractive for the high performance server market, where cooling of the hardware is a problem and IBM intends to prove the Cell worth by using it in the worlds first petaflop 20 computer [15]. 3.7 Discussion and Evaluation So has the STI alliance accomplished their goals and overcome the three performance walls? First lets take a look at how well the initial four design goals have been meet (see section 3.1). The Performance figures are impressive, and the Cell has many novel features that helps utilise the power. 20 GF LOP S

22 Responsiveness can be achieved by dedicating one or more SPE s to handle user inputs. CBEA seems flexible, and is not bound to any specific platform. Development was completed in So the goals seems to have been meet, but what about the practical applications of the Cell? The CBEA calls for massive parallelisation of algorithms. The complexity of algorithms usually increases when they are parallelised, and the cost of developing such algorithms does not always match the benefits. If we look at commercial games developed for the PS3 since its introduction in 2006, the graphics are roughly identical to the ones developed for the competing Xbox 360 [17], which use more conventional hardware. The PS3 has about twice as much CPU power compared to the Xbox, and one would expect games for the PS3 to be much more graphically advanced. But a combination of high-end standard processor technology and a good GPU, gives the Xbox roughly the same performance as a PS3 in most cases [44]. Successful scientific applications of the Cell processor have been scarce, but after 3 years on the market, it seems to have gained a strong foothold in the scientific community. Numerous problems, which can utilise the CBEA, have been found [31, 54], but because utilisation of the eight SPE s requires, that problems are highly parallelisable, many problems cannot benefit from the performance of the Cell processor. IBM is using the Cells ability to combat the tree walls as a marketing punch line, but how well does the Cell actually scale the three walls? It does push the boundaries of what is possible with a single processor, but it seems the three walls are still an obstacle. Here s a little summary of how the Cell scales to the three walls as I see it. The power wall: By using two types of processors optimized for either control intensive code or computational intensive code, it has been possible to simplify the design and thereby reducing power consumption. Though the Cell delivers an impressive GFLOPS/Watt-ratio, it still faces the same power related problems as any other processor. The power wall has been pushed back but not overcome. The memory wall: The usage of XDR RAM provides fast access to data, but XDR RAM is only another memory technology, which any other processor could adopt. Newer versions of IBM s Cell blade servers use DDR2 RAM just like most other x86 systems. DDR2 RAM is cheaper and makes it possible for the Cell to have a decent amount of RAM and still be affordable. The manually controlled LS combined with asynchronous DMA transfers, allow developers to create very efficient 21

23 data transfer schemes, where memory latencies are hidden. However, data still needs to be loaded and stored in main memory, and if all nine processors needs to access the main memory simultaneously, the Cell could still run out of memory bandwidth, but for all practical purposes it seems the Cell scales quite well to the memory wall. Frequency wall: By using simple pipelines and specialised processors, the Cell can be clocked to high frequencies. Especially the SPE s are well suited for high clock frequencies as the large register file helps keep up a steady stream of SIMD instructions to the SPU. Asynchronous DMA transfers and a large bandwidth between the SPU and LS, also helps minimise latencies in SPE s. Information on the maximum clock frequency of the Cell is not available, but a 6GHz version has been tested with positive results [30], which mean that the Cell can achieve even higher performance by increasing the clock frequency. However, some problems have to be solved before the clock frequency is increased in commercial products [30], which means the Cell has not overcome the frequency wall yet. Overall STI have managed to a novel architechture, that has the potential to outperform contemporary main stream processors, but it has not been the revolution many people expected. Software development seems to be going slowly and there is still only a few examples where the Cell completely outperforms all other main stream processors. CBEA has pushed back all three performance walls but not removed them, and it seems some additional development is required before CBEA will replace any existing architectures. 4 Cell Software Development CBEA s heterogeneous design requires application of special programming techniques to utilise the full potential of the Cell processor. Compilers and libraries have been developed to provide a somewhat standard programming environment, but Cell software development is still a complicated task. Especially the SPE s are problematic because they rely on software to manage data transfers, have to be manually load balanced and requires low level optimisations to perform well. This section is an introduction to software development on the Cell processor. It contains an overview of the Cell Software Development Kit (SDK) [4], and introduces several common techniques, which can be used to utilise the heterogeneous architecture of the Cell processor. A survey of selected software development tools and libraries for the Cell processor provided in the SDK, are also included in this section. The main purpose of the tools and techniques described here, is to help developers utilise the SPE s and some tools have been used to develop the 22

24 software described in section 5. Information regarding the basics of Cell software development was found in [20, 22, 24]. 4.1 The Cell Software Development Kit The current SDK is version 3.0 and has been developed by IBM. It supports both Fedora 7 and Red Hat Enterprise Linux 5.1, and is publicly available from IBM s website. The SDK contains all tools needed to develop fully functional software for the Cell processor such as compilers, libraries for C/C++, code examples, optimisation tools and an IDE 21 plug-in. C and C++ are the two main languages supported by the SDK. Fortran and ADA are also supported, but will not be discussed in this thesis. Some of the central libraries in the SDK contains bugs that causes problems in C++. There are ways to circumvent these bugs, but to avoid any complications, it s preferable to use C, especially when developing code for the SPE s. All code developed in this thesis was written in C++, which gave rise to problems with both libraries and development tools. 4.2 Programming the Power Processor Element From a software developer s point of view, the PPE doesn t differ significantly from other mainstream general purpose processors. Because the PPEs instruction set conforms with version of the PowerPC Architecture instruction set [20], it s able to execute existing PowerPC code. However, using the VXU and controlling SPE s does require the use of some special features provided by the Cell SDK. Even though the SPE s are better suited for SIMD operations than the PPE, it can sometime be advantageous to use the PPE s VXU for SIMD calculations. The VXU is accessed through a separate instruction set, which can be interleaved with the PowerPC instruction set. The instructions follows Apples Altivec 22 standard, and are comparable to Intel s SSE standard, but DPFP and 64-bit integer SIMD operations are not supported. C/C++ intrinsic 23 for the VXU instruction set are provided in the SDK and gives easy access to both SIMD calculations and vector data types by extending the C/C++ language. Some functionality provided by the VXU intrinsics, like simple arithmetic operations on vector data types, can also be accessed via normal C/C++ syntax. However, this is a feature provided by compilers, and might not always be available. 21 Integrated Development Environment An intrinsic maps an alias to one or more assembly-language instructions and are usually defined in header files 23

25 Controlling SPE s from the PPE is done through a library called lipspe2, which implements all necessary tools for activating and controlling the SPE s. the PPE communicates with an SPE through a SPE-context, which can be seen as a logical representation of an SPE. Each SPE-context contains a thread 24, which allows the operating system to schedule the context on an available SPE. It s advisable not to create more SPE-contexts than the number of SPE s available on the system, because context switches on SPE s are very expensive. SPE-contexts can be grouped together by placing them in a gang. SPE-contexts in the same gang will be scheduled on SPE s physically close to each other on the Cell chip, thus reducing the communication distance on the EIB and can be used by developers to design efficient SPE-to-SPE communication patterns. Apart from controlling SPE s the PPE is also responsible for allocating main memory, which can be shared with SPE s. The alignment constraints described in subsection must be followed or the program will simply crash, when an SPE tries to transfer a block of unaligned data. The constrains can be easily satisfied by aligning all shared data structures to a 128-byte boundary. Though this is not the most memory efficient solution, it maximise the DMA transfer speed, and minimise the need for error prone address calculations. The SDK does contain a simple function, called malloc align, which can dynamically allocate blocks of aligned memory. static memory allocations are aligned using the aligned attribute compiler directive. 4.3 Programming the Synergistic Processor Element SPE s use a different instruction set than the PPE, which only contains 128-bit instructions. SPE s can, nevertheless, be programmed as any other general purpose processor because the SPE instruction set contains SIMD instructions, which simulates scalar operations using 128-bit registers. These instructions facilitate compilation of scalar code, but the result of such a compilation is not very efficient. To achieve maximum performance, developers have to write performance critical parts of algorithms using real SIMD instructions by hand which is time consuming. The Cell SDK provides intrinsics for the SPE instruction set, similar to those available for the PPE, which gives easy access to all SIMD instructions. Usage of SIMD operations is not the only thing, which can affect the performance of SPE s. 16-byte alignment of data in LS is also important as the SPU can only perform 16-byte aligned loads from LS. Misaligned data is loaded by performing a misaligned load, which uses extra instructions and can significantly degrade the performance of a program. Alignments of data structures in the LS can be handled by using malloc align and the aligned 24 A POSIX thread if the operating system is Linux 24

26 attribute compiler directive. Hardware branch prediction is not available in the SPE s, so branch hints have to be inserted by either compilers or developers. A correct branch hint can remove expensive stalls in the SPE s, while incorrect branch hints can create stalls, but even with good compiler hints, SPE s will often have a low performance on code with many conditional branches. This has to be taken into consideration when developing code as each branch potentially could become a major bottleneck, especially if it is located inside a loop. Loop unrolling and replacing branches with equivalent SIMD code can be used to reduce the number of branches in SPE code. Both techniques are described in section 5.1. The standard libraries, included in the SDK, also provide intrinsics for controlling DMA transfers. These intrinsics are low level and operate directly on memory pointers, making them complex to use but very versatile. DMA intrinsics often require cumbersome pointer arithmetics, and knowledge of the Cell architecture is essential. In addition, developers must handle all concurrency related issues, and with 8 SPE s accessing the main memory at the same time, this is not always an easy task. On the other hand, having low-level control of DMA transfers gives developers freedom to explore clever new data flow schemes, and thereby fully utilise all nine cores on the Cell. As mentioned in DMA transfer latencies can be hidden by utilising the MFC s ability to process DMA transfers autonomously. If an SPE program is data intensive, then hiding DMA transfer latencies can give a considerable performance boost. This can be done by using two or more buffers to create a circular data queue in the LS. If two buffers are used the technique is called double buffering. Figure 6 illustrates a double buffering scheme, where one buffer is being filled with data, while the other one is being processed by the SPU. Initiate DMA transfer from EA to LS buffer 1 Initiate DMA transfer from EA to LS buffer 2 Wait for DMA transfer to buffer 1 to complete Process data in buffer 1 Process data in buffer 2 Wait for DMA transfer to buffer 2 to complete Initiate DMA transfer from EA to LS buffer 1 Figure 6: Double buffering scheme. It can be necessary to use more than two buffers, in which case the 25

27 scheme is just expanded as needed. The two buffers must be independent, and the data partitioned in a way, which can be implemented without a large overhead. While double buffering is one of the most important Cell programming techniques, it s also time consuming to implement and further complicates the tedious work of constructing efficient data flows. The complexity of DMA transfers is a problem, which IBM has recognised and several libraries have already been developed, which handles many aspects of DMA transfers and lets developers design data flow schemes using abstractions. Three such libraries are included in the Cell SDK and section describes two of these in detail. A steady supply of fresh data is essential to keep each SPE busy and with only 256KB available, it is often a challenge for developers to fit enough data into the LS. The LS have to contain both the SPU instructions and the data needed, so minimising the size of the SPE code can be necessary. Because of the size restrictions, standard library functions must be used carefully. For instance, using the C++ standard output stream cout requires library code that increases the size of binaries considerably. Using the standard printf function, instead of cout, will result in a smaller binary and leave more space in the LS for other data. Overlays can be used to overcome the restrictions on code size. By defining sections of SPE code as overlays, they can be stored in main memory and loaded/unloaded by SPE s at runtime as needed, thereby making it possible for SPE programs to exceed 256KB. Overlays are only useful, if the algorithm can be divided into reasonable disjunct components, which rarely needs to be loaded/unloaded because there s a large overhead associated with changing overlays. 4.4 Compilers The Cell SDK contains several compilers for Cell processors where most are part of the well known GNU tool chain. Two different compilers are needed to compile Cell binaries, because the PPE and SPE instruction sets are very different. Compilation with the GNU tool chain compilers can be somewhat complicated and Cell binaries can be compiled in several ways depending what compilers are used and what kind of output is needed. One commen way of compiling a program for the Cell processor goes as follows: Compile the SPE code into a binary Wrap the SPE binary into a Cell broadband engine Embedded SPE Object Format (CESOF) likable file. Compile the PPE code into a binary and link it with the CESOF linkable file. This will embed the SPE binary into the PPE binary. 26

28 This compilation process results in a single binary file, which can be directly executed on a Cell system. The SDK contains a large make script, which can be used to invoke all necessary tools with the correct parameters, which is very useful for learning how the compilation process works. To enable cross platform development, the GNU tool chain also includes cross-compilers, which can compile Cell binaries on a x86-system. It s preferable to develop code directly on a Cell system, but if such a system is not available or the system is not suited for software development (the PS3 has on 256 MB RAM, which is not enough to run an IDE), the cross-compilers comes in handy. IBM has developed their own compiler collection for Cell processors called XL C/C++. It is quite similar to the GNU tool chain but has some additional features like automatic overlays and improved auto-vectorisation. The XL C/C++ compilers are not available in the SDK, but an alpha version of their single source compiler [23] called IBM C/C++ Alpha Edition for Multi-core Acceleration single source compiler is included in the SDK. This compiler is based on the XL C/C++ compilers and can compile both PPE and SPE code to a binary with a single command. Both GNU and IBM compilers are able to perform auto-vectorisation 25. This technology have been the subject of extensive research [1, 45] in the last couple of years, and today s compilers are capable of auto-vectorising many types of scalar loops. It s still not possible to rely solely on this technology for vectorisation of code, but if the code has been written with auto-vectorisation in mind, this technology can relieve developers of some work, and make the code more readable. Compiler technology is an interesting research area regarding Cell processors, because compilers might be able to relieve software developers from some of most cumbersome tasks associated with Cell software development. The aim is to develop compilers, which make software development for the Cell as easy as writing code for a single core x86 system [29]. Researchers at IBM are currently trying to develop a compiler for the Cell that is capable of handling the following tasks: Convert single threaded code into multi-threaded code. Distribute workloads on SPE s. Advanced auto-vectorisation of code. Provide a high-level memory abstraction with only one address space. Handle data flow, i.e. hide DMA transfers. 25 Transform scalar code into SIMD code 27

29 The compiler is called Octopiler 26 and has been under development for some years now. It will still take some time before the Octopiler is ready (that is, if it is ever going to be ready), so currently we have to make due with more simple compilers. If IBM succeed at making a compiler which can handle the all the tasks above, it would make the Cell available to a much broader audience and allow old code to be recompiled for the Cell architecture without rewriting everything. Manual optimisations will almost certainly still be needed to reach peak performance, but if a compiler could generate code utilising just half of the Cells performance, it would still make the Cell much more attractive. 4.5 Programming Models To take advantage of all nine cores on the Cell, an application has to be partitioned and the workload distributed. Several programming models have been developed, which suggest different partitioning schemes, and can be used as inspiration for software developers. It is important to consider several different ways of partitioning a problem, before choosing one and implementing it. One model might to offer an easy way of solving a particular problem, while another model offers better performance. In this thesis, several models were considered in the development of software for the experiments, but only the parallel stages model seemed suitable for the NJ method. Programming models are either PPE-centric or SPE-centric. PPE-centric programming models use the PPE for the main application and offloads tasks to the SPE s. SPE s are processing data while the PPE atcs as a coordinator. Many problems are easy to implement using PPE-centric models, which makes these models quite common. In SPE-centric programming models the main part of the application are distributed across the SPE s. SPE s manage distribution of workloads themselves while the PPE acts mainly as a resource manager e.g. by running an operating system. [22, 24, 41] describes several programming models, but most models are just a specialisation of a more general model, so I will only present a few high level models, which captures the main concepts in this thesis Service Model The Service Model uses SPE s as service providers, to which the PPE can offload various tasks. This model has a very broad range of applications 26 cellcompiler.index.html 28

30 and can be used to utilise SPE s in existing applications, without rewriting the whole application. The PPE has complete control over the application, while SPE s are used as co-processors which makes this model PPE-centric. The Service Model is simple and versatile, but it relies heavily on the PPE to control SPE s, which can make the PPE a bottleneck. In addition, this model does not utilise SPE-to-SPE communication, which is one of the Cell processors major strengths. An important subclass of this model is the Function-Offload Model, that use SPE s to calculate one or more time consuming functions like matrix multiplication instead of using the PPE. It s very easy to use and well suited for accelerating existing applications. See Fig. 7 for an illustration of the Service Model. SPE0 Servce A PPE Main application SPE1 Service B SPE7 Service G Figure 7: The service model Multistage Pipeline Model The Multistage Model use two or more SPE s to create a pipeline. The PPE supplies data to the pipeline entry, after which each SPE in the pipeline performs one or more operations on the data before forwarding it to the next SPE. The pipeline ends at the PPE, or is sent directly to another element on the EIB like a I/O unit. This model is also PPE-centric as the PPE is in control of the pipeline. Video decoders are a good example of how the Multistage Pipeline Model can be used. Video data is streamed through the pipeline where each SPE performs functions like video/audio decoding, error correction or contrast/brightness processing, before the data stream is sent to a video or audio unit [25]. 29

31 PPE Main application SPE0 Stage A SPE1 Stage B SPE7 Stage G Figure 8: The Multistage Pipeline Model. As opposed to the Service Model, the Multistage Pipeline Model relies on SPE-to-SPE communication, thereby making good use of the huge bandwidth offered by the EIB. It s, however, difficult to distribute the workload evenly among SPE s, and often the total performance depends on a few SPE s in the pipeline. The Multistage Model can also be used to execute code, that cannot fit into a single LS, by distributing the code over multiple SPE s. See Fig. 8 for an illustration of the Multistage Pipeline model Parallel Stages Model This model can be considered as a specialisation of the Service model, because it coresponds to a service model with the same service running on all SPE s. However, the Parallel Stages model is a very important model and is therefore presented here explicitly. The model is perfect for problems, which are easily parallelised and contains large amounts of data. The PPE sends data to the SPE s, which then processes the data in parallel and returns the output to the PPE or some other element on the EIB. It s a PPE-centric model like the Multistage Pipeline Model which it also shares some common features with. This model captures the most basic way of utilising a Cell processor and will often give a good performance, but it also requires that the problem can be fully parallelised. Because the PPE is responsible for handling partitioning of the data and reassembling the results returned from the SPE s, the PPE risk being overloaded if the partition scheme is too complex. The model could be expanded with SPE-to-SPE communication to relieve the PPE. See Fig. 9 for an illustration of the Parallel Stages Model. 30

32 PPE Main application SPE0 Stage A SPE1 Stage B SPE7 Stage G Figure 9: The Parallel Stages Model Shared Memory Multiprocessor Model This is an SPE-centric model, where all SPE s function completely independently, thereby creating the illusion of a system with 9 CPU s. The idea is to use the LS of each SPE as a cache and use the main memory of the Cell as the primary memory of SPE s. Instructions as well as data should be located in the main memory, while a small kernel in each SPE use DMA transfers to fetch data and instructions needed by an SPE from the main memory into the LS. The PPE should be used as a resource manager, but can also be included in the pool of available processors. It s a rather complicated model, that requires either a complex SPE kernel or a compiler, which can translate SPE load and store instructions into DMA transfers and use part of the LS as a cache. Concurrent access to data could become a major hazard, and must be handled by e.g. mutexes. The implementation should also maintain coherence between data in a LS and main memory. The model can be simplified if each SPE is assigned disjoint sections of the main memory and only use mailboxes or signal notification for inter-process communication. I don t know if this model would be usable for any practical applications, but it contains some very exciting ideas and could perhaps be useable in a simplified form. Figure 10 illustrates the Shared Memory Multiprocessor Model. 31

33 SPE0 CPU B SPE1 CPU C PPE Resource manager / CPU A Main memory / Shared memory SPE7 CPU E Figure 10: The Parallel Stages Model. 4.6 Software Development Tools This section presents some of the key software development tools available in the Cell SDK. It also contains a discussion of how they can be applied in the development process and comments on their usefulness in normal software development. Several of these tools have been used to develop the software used in this thesis with varying success. Some tools still need development, some didn t fit into the software development process used in this thesis and some simply didn t work because of hardware restrictions Libraries Libraries which helps utilise the SPE s in the Cell processor can speed up code development considerably, but it s hard to make such libraries generic while fully utilising the Cell processor. Three such libraries are included in the Cell SDK. Two will be described in this section while the last one (DACS) is beyond the scope of this thesis. Basic Linear Algebra Subprograms: BLAS is a de facto application programming interface standard for linear algebra libraries and libraries implementing BLAS have been developed for many different languages and platforms. These libraries can perform the most basic linear algebra operations, like matrix multiplication, which are often time consuming and a critical part of an application. Some BLAS libraries, such as Intel MKL [11] and CUDA SDK [12], are written for specific architectures and tries to speed up calculations by using the hardware features available. The Cell SDK contains a BLAS library, which is optimised for the Cell 32

34 architecture [19]. All functions in the BLAS API have been implemented for the PPE, and some have been optimised to take advantage of SPE s. Matrix multiplication is a good example of a function using SPE s in the Cell BLAS library. Large matrices are divided into smaller sub-matrices that fit into the LS of SPE s. These sub-matrices are then multiplied in parallel using all available SPE s, and reasembled to form the resulting matrix. In general the Cell BLAS library can be used to efficiently solve standard linear algebra problems with minimal effort and little knowledge of the CBEA. However, because the BLAS library is designed to handle general problems, it is doubtful if it can match the performance of specialised implementations. The Blas library has not been used in this thesis, as none of the implementations used in the experiments make use of linear algebra. Accelerated Library Framework: ALF [18] is a library, that aims to help developers offload computationally intensive tasks to SPE s by using the Service programming model (See section 4.5.1). In this thesis ALF has been used to implement the NJ method, and a desription of both the implementation and the results of experiments with this implementation, can be found in section In ALF a main application is executed on the PPE as usual, but access to SPE s is done through ALF by using a task abstraction. Each task represents a small SPE program, which ALF is responsible for scheduling on the SPE s. Developers can choose to assign a fixed number of SPE s for each task or let ALF assign an appropriate number of SPE s. Inside each task is a user written kernel, which is responsible for processing units of work called work blocks. A work block is a piece of a larger problem (e.g. part of a matrix), which can be processed independently of other work blocks. Partitioning of a problem into work blocks is done by user written code, and can be performed on either an SPE or the PPE. When a work block has been created and assigned data, it s enqueued in a task for processing. ALF then distributes the enqueued work blocks among the SPE s assigned to the task in which the work block is enqueued. Both Tasks and work blocks can be executed in parallel, but where ALF can be configured to handle task dependencies, work blocks have to be independent, because developers have little control over the order in which work blocks are executed. All DMA transfers between the PPE and SPE s are handled by ALF. Developers need only to define tasks, create work blocks and write the SPE kernel. Each work block is associated with an input buffer, which is mapped to a piece of main memory containing data and it is this mapping that effectively defines the problem partition. When a work block is assigned to an SPE, ALF will transfer the contents of the input buffer to the LS of the SPE before execution of the task kernel commences. An output buffer, which is also mapped to the main memory, can be used for writing results back into the main memory. ALF uses different buffering schemes (including 33

35 double buffering) to transfer data in input or output buffers between the LS and main memory. The optimal buffering scheme is selected by ALF based on a number of factors, where the size of the buffers and amount of free space in LS are the most important. Developers can indirectly affect which scheme is used by controlling the size of buffers and available LS memory. The nice abstractions and hidden DMA transfers in ALF comes at the cost of flexibility and speed. ALF does not allow SPE s to communicate with each other since it only supports the service model. In addition, the lack of control over DMA transfers makes some data transfers strategies hard to implement, and may force the developer to use suboptimal schemes. For example, if each work block in a task needs a chunk of static data, either a sufficiently large fixed sized data structure has to be created as a part of the task-context 27, or the input buffer has to contain the data, resulting in the data being transferred to the SPE each time a work block is processed. A better solution to this problem can be implemented by using manual DMA transfers as described in ALF does not hide the heterogeneous architecture of the Cell completely. Knowledge of the Cell architecture is still important to write high performance applications, and some alignment constraints still needs to be handled by developers. But despite some shortcomings, ALF does make software development for Cell processors more available. The task abstractions are easy to use and makes utilisation of the SPE s more manageable The IBM Full-System Simulator Parallel with development of the Cell processor, IBM also developed a Cell simulator. The simulator provided a cost efficient way of testing hardware specifications, and enabled the development team to assess performance of hardware features before the actual hardware was available. Theoretical estimates of hardware performance were used to build the initial simulator, and as hardware components became available, the simulation was updated to reflect the actual performance thereby improving the simulation quality. The simulator also enabled software development to commence before hardware development had completed, which minimised the need for hardware prototypes [46]. Today the simulator is available as a part of the Cell SDK to allow software development without access to Cell hardware and to provide advanced debugging and optimisation features. The simulator is being continuously improved with new features, better simulation quality and better performance to satisfy the need of developers. The current version of the simulator can simulate individual components of a Cell processor very accurately and provides tools for detailed performance analysis and debugging. 27 a data structure which is transferred to the SPE when the task is initialised 34

36 The following description and discussion of the simulator is based on [40] and [21] plus experiences with the simulator during this thesis. It was mainly used to verify the effect of optimisations in this thesis and not as an integrated part of the development process for reasons explained later in this section. The simulator can be booted in two modes. A standalone mode where the simulator handles all system calls and a Linux mode where a Fedora Linux distribution kernel is booted. Only the Linux mode has been used in this thesis to evaluate the simulator and will be used to describe the simulator. After booting the in Linux mode, two console windows are available. One for Linux and another for the simulator. The Linux console allows normal interaction with the operating system, and access to the simulators file system. The simulator console lets the user interact with the simulator software and is also used for displaying runtime information from the simulator. Files can be transferred to and from the simulators files system by using the callthru command and all files transferred to the simulator are stored in an image file, which can be reused in future simulation sessions. Several levels of simulation are available in the simulator. The level of simulation is controlled by selecting one of three simulation modes for each type of processor. The PPE modes are described in the following. Fast mode simulates the effects of instructions and do not attempt to simulate any aspects of the actual execution time. This mode is mainly used to test the functionality of a program, and only a few performance statistics are available here. Simple mode assigns a fixed latency to each instruction and gives a loose estimate of the execution time. This mode is useful for debugging, and provides more performance statistics than fast mode. Cycle mode is the most precise mode. Most aspects of program execution in the PPE are simulated, including timing policies and the internal mechanisms of components. The simulator can be stepped forward one cycle at a time in this mode, which can be useful for debugging. The execution time of a program is modelled fairly accurately, but can differ significantly from the hardware execution time [13]. Each SPE can also be set into one of three modes. fast mode, instruction mode and pipeline mode which roughly correspond to the PPE modes in terms of granularity. Fast mode corresponds to the PPE fast mode, instruction mode corresponds to simple mode and pipeline mode corresponds to cycle mode. All modes have accurate functional simulation but performance statistics are only available in pipeline mode. 35

37 The performance statistics available in the simulator includes an overview of instructions responsible for pipeline stalls, the number instructions executed organised by type and the number of branch misses. They are all very useful for profiling code on both the PPE and SPE s but especially the SPE statistics seems useful, as they can be used for locating SPE code causing branch misses and pipeline stalls, which are a frequent cause of bad SPE performance. Most statistics are low level and knowledge of how code is compiled into instructions is useful when trying to interpret them. The simulator also provides some performance statistics for the SPE s that give a quick overview of SPE workloads. The simulator is also useful for tracking down elusive bugs because it can track the state of hardware components during execution of a program. Contents of registers in both SPU s and the PPU are available along with detailed information on the state of other component in SPE s like the LS and MFC. Generally, the debugging features of the simulator are low-level and require a comprehensive knowledge of the CBEA to be of any use. They seem to be best suited for debugging OS-kernels or very high performance code but one debugging feature is useful for even the most basic applications. Every time a runtime error occurs in an SPE, the Cell processor produces a standard error message like bus error, which does not explain the cause of the error. A very common source of such errors is DMA requests, which does not obey the alignment constrains. Such a DMA request causes immediate termination of the process with a standard error message on a PS3, while the simulator will give a detailed reason for the termination. The simulator is also usable for software development but it is preferable to use Cell hardware because of the simulation speed. Even when all processors are put in fast mode, it still takes forever to execute programs compared to executing them on a PS3. In addition, the simulator requires a large amount of RAM and CPU power, which can become a problem for old workstations. However, it is possible to execute the simulator on a remote machine and access it through a local workstation. A remote simulator might speed things up a little, but the simulation speed will still be a problem and probably slow down the development process. The simulator is only designed to execute code, and does not provide any facilities for writing code, which must take place on another system. If a non-cell system is used for software development then cross compilers can be used to create binaries, but it can be time consuming to move binaries and data between the development environment and the simulator. An Eclipse 28 plug-in, that integrates the simulator in the development environment, is available, which make it look like you are working on a Cell system. The simulator can be controlled through a GUI (see Fig 11), which gives easy access to various performance statistics and debugging information

Graphical representations of performance statistics can be generated in the GUI, and is a useful tool for identifying performance bottlenecks. Figure 11: The IBM Full-System Simulator GUI.

38 Graphical representations of performance statistics can be generated in the GUI, and is a useful tool for identifying performance bottlenecks. Figure 11: The IBM Full-System Simulator GUI. In this thesis, a PS3 was used for all phases in the software development process, and the simulator was only used to verify the effect of optimisations in SPE code. The simulation speed was the main reason for this choice, but a lack of documentation for the simulator also made it hard to figure out how different statistics were related to the actual code. The simulators background as tool for hardware verification makes it difficult to understand and use for high level software development. I think the simulator could be of some use for optimising SPE code, because SPE s are very simple and depends on low level code optimisations to achieve maximum performance, but I would personally prefer not to use it Debugging Tools Two debuggers, ppu-gdb and spe-gdb, are supplied with the Cell SDK [22]. ppu-gdb can debug both PPE and SPE code while only SPE code can be debugged with spe-gdb. They are both based on the well-known GNU Project Debugger (gdb) and both look and feels like the gdb. Only ppu-gdb will be described because spe-gdb is only used for standalone SPE applications, which is beyond the scope of this thesis. Usage of ppu-gdb is similar to the standard C/C++ gdb for x86 systems. 37

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer