Speeding up Parsimony Scoring with Streaming SIMD Extensions 2

Size: px

Start display at page:

Download "Speeding up Parsimony Scoring with Streaming SIMD Extensions 2"

Damon Francis Parker
6 years ago
Views:

1 Speeding up Parsimony Scoring with Streaming SIMD Extensions 2 Jason Evans <HTjevans@uidaho.eduTH> and James Foster <HTfoster@uidaho.eduTH> Initiative for Bioinformatics and Evolutionary Studies Department of Computer Science University of Idaho Moscow, ID Abstract The number of trees that can be evaluated in a given amount of time is critical to the effectiveness of heuristic phylogenetic tree searches. Due to its speed, maximum parsimony is often employed as the optimality criterion when analyzing large datasets. This paper presents a performance enhancement for parsimony scoring that is complementary to the highly refined techniques already discussed in the literature. Many modern microprocessors include single instruction, multiple data (SIMD) extensions. These instructions perform multiple operations in parallel, thus making substantial throughput improvements possible. The Intel IA32 implementation of SIMD is called Streaming SIMD Extensions 2 (SSE2), which is part of all Pentium 4 microprocessors. This paper presents a realistic performance comparison between an optimized C implementation and an SSE2 assembly language implementation of parsimony scoring. Empirical results show a factor of increase in performance. Introduction Systematists strive to determine the evolutionary relationships among sets of taxa. Relationships are determined by obtaining character data such as DNA or amino acid sequences for each taxon, then using the character differences among the taxa to infer phylogenetic relationships. One of the simplest inference approaches is to use maximum parsimony as the optimality criterion that determines the relative fitness of candidate phylogenetic trees. The principle of maximum parsimony rests on the idea that the simplest explanation is preferable. As the number of taxa in a data set increases linearly, the number of possible trees that describe the relationships among those taxa grows factorially. This makes exhaustive searching of all possible trees intractable for more than perhaps 30 taxa, which forces the use of heuristic searches for many interesting datasets. Effective heuristic searches must 1) carefully choose which candidate trees to consider, and 2) evaluate as many candidate trees as possible in the allotted time. The SSE2 optimizations presented in this paper directly address the second point. The simplest algorithm for calculating parsimony scores was proposed by (Fitch, 1975). A more sophisticated algorithm was suggested by (Sankoff, 1975), but the Fitch algorithm is faster and more commonly used. The majority of research into optimization techniques for parsimony scoring has focused on the Fitch algorithm, which is also the focus of this paper. Fast Fitch parsimony scoring approaches have been presented in detail (Gladstein, 1997; Goloboff, 1993; Goloboff, 1996; Ronquist, 1998), and are used by some the fastest publicly available implementations (Goloboff, 1999; Swofford, 2002). Some of the optimization

2 techniques are quite involved; only the aspects that directly impact SSE2 optimization are mentioned in this paper. Fitch parsimony scoring The Fitch algorithm for parsimony scoring assumes that all character states are the same evolutionary distance from each other. This means that for DNA character data, no distinction is made between transitions (A G, C T) and transversions (A/G C/T). Since character states are not classified, they can be dealt with as uniform sets. The Fitch algorithm is a dynamic programming algorithm; the partial score of each node in a tree depends only on the states of its children. Therefore, the Fitch algorithm can be implemented as a post-order tree traversal, wherein the following steps are performed at each node: 1) If a leaf node: a. Initialize the character state set to contain the characters associated with the taxon. In the case of DNA, ambiguity codes are decomposed into their constituent bases. For example, V translates to{ A, CG, }, and (gap) translates to{ A, CGT,, }. 2) If an internal node: a. Create the intersection of the child nodes state sets, and associate the result with the node. b. If the set created in (a) is empty: i. Create the union of the child nodes state sets, and associate the result with the node. ii. Increment the parsimony score. Figure 1 shows a tree with five taxa. The state sets are shown for each node, and a + denotes nodes at which the parsimony score was incremented. A more realistic example would have multiple characters. Figure 2 shows the partial scoring results for an internal node of a tree with four characters. Each character is calculated independently of the others, but the total score for the tree is the sum of the scores for all characters. {A} {A} {C} {T} {C} {A} {C,T}+ {A,C}+ {C} Figure 1. A 5-taxon tree with Fitch parsimony state sets and scores at nodes. The tree has a total score of 2. 2

3 Child Child {G,T} {A,C} {A,G,T} {T} {A,C} {G} {A,T} {T} Parent {A,C,G,T}+ {A,C,G}+ {A,T} {T} Figure 2. Characters state sets and scores of an internal node and its children, in a tree with 4 characters. The parent node has a score of 2. Tree bisection and reconnection (TBR) hill climbing Heuristic tree searches typically employ some form of hill climbing. The tree bisection and reconnection (TBR) transform is most often used to create a network of trees that define the landscape in which hill climbing is conducted. Each step of a hill climb evaluates all the neighbors of the current tree, and holds one or more trees with better scores for later consideration. If all trees with better scores are held, the number of held trees can quickly become unmanageable. For large datasets it is commonly necessary to consider only the best of the neighboring trees, which results in following the steepest paths. Limiting a search to steepest paths allows the early termination of scoring for trees in a neighborhood that are known to not have the best score. This optimization can provide a substantial speedup, and is a critical optimization for any high performance implementation of parsimony/tbr-based hill climbing. As will be seen later, this optimization confounds the SSE2 optimization, since the termination check must happen in the inner loop of the scoring function. The TBR transform consists of bisecting a tree at some edge, and reconnecting the subtrees by picking one edge from each subtree and connecting those two edges together (Felsenstein, 2004). Figure 3 shows an example TBR transform. During hill climbing searches, trees are evaluated an entire neighborhood at a time. Therefore, pre-calculation and caching of partial parsimony scoring results have the potential to drastically reduce the total amount of calculation. 3

4 B A Bisect I H G B A E I C C F D D E Reconnect Figure 3. The tree on the left can be transformed via TBR to the tree on the right. F G H A pre-calculation approach that was originally described by (Goloboff, 1993) reduces the total amount of calculation by approximately a factor of the number of internal nodes in the tree. This optimization is so large that it fundamentally changes the nature of parsimony/tbr-based neighborhood scoring. The key observation that motivates this optimization is that for each bisection of a tree, the trees in that portion of the TBR neighborhood are composed of the same two subtrees. This means that calculating the parsimony scores for an entire TBR neighborhood can be quickly done by performing the following steps for each edge in the tree: 1) Bisect the tree (create two subtrees). 2) For each subtree, calculate the state sets and score for each possible rooting. 3) Reconnect the two subtrees using every valid combination of edges (one combination would reverse the bisection) and calculate the final score for each resulting tree. A caching optimization developed by (Gladstein, 1997) is applicable to step (2), although Gladstein originally presented the optimization as being orthogonal to Goloboff s approach. Both optimizations are employed in the SSE2 experiments. SSE2 optimization Speeding up Fitch parsimony scoring with SIMD was suggested by (Ronquist, 1998), though only theoretical results for a Motorola PowerPC 604 processor were provided. Ronquist provided alternate algorithms for horizontal and vertical packing of character data. The results in this paper were obtained using horizontal packing, which means that two DNA character state sets are stored in each byte of memory. For example, a byte with a value of 107 (binary ) translates to { CG, },{ AGT,, }. SSE2 provides eight 128-bit registers (xmm0 through xmm7), which are treated in this paper as a vector of sixteen independent bytes. Since each byte contains two character state sets, it is possible to process 32 characters per iteration of the inner scoring loop, as compared to two characters per iteration for a non-vectorized implementation. One of the biggest challenges to vectorizing code is avoiding data-dependent conditional branches. Since many data elements are being processed in parallel, the program cannot take different branches for each data element. SSE2 s pcmpeqb instruction provides an elegant solution by performing a bytewise comparison of two SSE2 registers and storing result bitmasks in one of those registers (Figure 4). A general strategy for dealing with branches is to calculate the results for both code paths, mask out the bytes for which the results of each branch are 4

5 invalid, and then merge the results. The SSE2 implementation of Fitch parsimony scoring uses this basic approach, but also uses bitmasks to calculate the score. pcmpeqb %%xmm1, %%xmm2 xmm = =...13 = s... = xmm xmm Figure 4. The pcmpeqb instruction compares the bytes of two registers and sets a bitmask accordingly. If the bytes are equal, the bitmask is set to all 1 bits, otherwise the bitmask is set to all 0 bits. Although SSE2 was primarily designed for streaming multimedia applications, all the necessary functionality for vectorization of parsimony scoring is present: unaligned and aligned memory load/store instructions, bitwise logical instructions, bitwise shifting instructions, and math instructions. Each iteration of the inner loop of SSE2-based parsimony scoring performs the following operations: 1) Read sixteen bytes of character data (32 characters) from the character state set vector of each child node. 2) Process the state sets stored in the upper four bits of each byte. 3) Process the state sets stored in the lower four bits of each byte. 4) Sum the total number of changes, and add them to the current parsimony score. 5) For an internal node, store the resulting state set vector in memory. For a root node (final tree scoring), do not bother storing the resulting state set vector, and terminate scoring if the maximum interesting score was exceeded. Note that there are two alternatives for step (5). The implementation that is used for the experiments contains two separate functions that implement these alternatives, in order to maximize performance. A short description of the C implementation is needed in order to understand the tradeoffs that the SSE2 implementation must make. The inner loop of the C version of the program is unrolled, so that 32 characters are processed per iteration. This measurably improves performance, by reducing the overhead imposed by the loop conditional. Unlike the SSE2 version, the C version is able to check whether the maximum score has been exceeded precisely when it actually increments the score, and then terminate scoring immediately after exceeding the threshold. By comparison, the SSE2 version only checks once per loop iteration (every 32 characters), which means that the C version typically reads and processes fewer characters. 5

6 Experiment The SSE2-optimized implementation and a C-only implementation of Fitch parsimony scoring were used to analyze an aligned dataset consisting of 759 informative characters of rbcl data for 500 taxa (Chase, 1993). The dataset has become a rather standard benchmark for the effectiveness of heuristic search techniques on large datasets (Nixon, 1999; Rice, 1997; Snell, 2000). Real data are preferable to simulated data for this experiment, since early termination behavior is data-dependent. The performance of the two implementations was compared for three different experimental configurations: 1. Starting at the locally optimal tree reported by (Rice, 1997) and published in electronic form (Rice), the parsimony scores for all 9,266,156 trees in the immediate neighborhood were calculated. This was repeated 100 times in a single program run, for a total of 926 million tree evaluations. 2. Starting at the same tree as in configuration (1), the best trees in the immediate neighborhood were found. This was repeated 100 times in a single program run, for a total of 926 million tree evaluations pseudo-random trees were generated. For each tree, the best neighbors were found 100 times. All of this was done during a single program run, for a total of 74.3 billion tree evaluations. The same 100 trees were used for the C and SSE2 program runs. The tree in configurations (1) and (2) is a local optimum, which reduces the effectiveness of the early termination optimization. The trees in configuration (3) tend to each have a very diverse neighborhood, which allows early termination to happen more often. The configurations focus on the early termination optimization due to its variable negative impact on the effectiveness of the SSE2 optimization. The 100 pseudo-random trees in configuration (3) were drawn from approximately population mean, with very high probability. The main point of this configuration is simply to measure performance for trees with diverse neighborhoods, so there is little benefit to separately measuring the results for each of the neighborhoods. With that in mind, the results for this experimental configuration are summarized by dividing the total number of trees considered by the total time taken, just as in configurations (1) and (2). All neighborhoods were iterated over repeatedly, in order to reduce the stochastic effects of data caching and to increase the total program runtimes to a point where time measurement error 1280 possible trees. Therefore, the mean parsimony score for the trees is near the (typically ± 10 ms) did not significantly impact the accuracy of the results. All experiments were run on a four-processor 2.8 GHz Pentium 4-based computer. The operating system is a Linux variant, with a based SMP kernel. No multi-threading was used in the experiments, so the multiple processors had no positive impact on the experiments. The test program was compiled using gcc 3.3.3, with the -O2 optimization flag specified. Figure 5 summarizes the results for these experiments. The speedup ranges from a factor of 2.1 to 2.5. The results are interpreted to indicate that speedup differences are dependent on two factors: 1) the relative overhead of the early termination check for the C and SSE2 versions, and 2) the effectiveness of the early termination check. The speedup difference between configurations (1) and (2) is attributed to the first factor, and the speedup difference between configurations (2) and (3) is attributed to the second factor. 6

7 7 2.5X speedup 6 Millions of trees/sec X speedup 1 2.5X speedup 0 1. Peak all 2. Peak best 3. Random best C SSE Figure 5. Millions of tree evaluations per second of C and SSE2 implementations, for three different experimental configurations. The SSE2 implementation is times faster than the C implementation for these experiments. Discussion A speedup of X is a substantial, consistent performance improvement that is certainly worth the programming effort, especially if a program spends more than a few days performing data analysis. One might expect an order of magnitude performance improvement, but in fact, the theoretical maximum improvement is approximately in the 3-5X range. There are two contributing factors to this discrepancy. First, the Pentium 4 processor is three-way superscalar, which allows it to retire up to three instructions per cycle. However, there is only one floating point unit, which means that only one SSE2 instruction can run at a time. Second, the Pentium 4 implements the x86 instruction set, but internally, these instructions are translated to a RISC-like set of micro-ops. In the case of many SSE2 instructions, at least two micro-ops are needed, since the floating point unit only handles 64 bits of data at a time. Therefore, if conditions were ideal for non-sse2 code, six instructions could be retired in the same amount of time that one SSE2 instruction takes to run. This would reduce the theoretical maximum speedup from 16X to approximately 2.7X, in the worst case. The experiments presented in this paper are meant to be a reasonably realistic comparison of the expected performance difference of replacing an optimized C implementation of parsimony scoring with an SSE2 implementation. A less realistic experiment would leave out the early termination optimization. Earlier versions of the test program did not implement early termination, and the observed speedup was approximately 2.9X. 7

8 A substantial increase of the number of characters would benefit the SSE2 implementation in benchmarks, especially if early termination were omitted. Microprocessors often spend time waiting for data to be read from memory, but linear memory access performs well due to predictive data prefetch. The character state set vectors for the test dataset are only 384 bytes long, so the startup cost of reading each vector probably has a significant impact on overall throughput. The SSE2 implementation is more vulnerable to this issue because it is faster, and because it reads more data on average. SSE2 optimizations are only of benefit when using certain Intel and AMD microprocessors. Many researchers (authors included) also use PowerPC-based systems, so future work will likely include the implementation of similar optimizations using the AltiVec instruction set, which is present in Apple G4- and G5-based systems. Availability Source code for the program that was used in the experiments is available upon request. Acknowledgements This work was partially funded by NIH NCRR 1P20 RR Experiments were run on the IBEST Beowulf cluster, which is funded in part by NSF EPS , NIH NCRR 1P20 RR16448, and NIH NCRR 1P20 RR The authors are grateful for Robert Beers s help in understanding the internals of the Pentium 4 processor. References Chase, M. W., D.E. Soltis, R.G. Olmstead, D. Morgan, D.H. Les, B.D. Mishler, M.R. Duvall, R.A. Price, H.G. Hills, Y.-L. Qiu, K.A. Kron, J.H. Rettig, E. Conti, J.D. Palmer, J.R. Manhart, K.J. Sytsma, H.J. Michaels, W.J. Kress, K.G. Karol, W.D. Clark, M. H. Hédren, B.S. Gaut, R.K. Jansen, K.-J. Kim, C.F. Wimpee, J.F. Smith, G.R. Furnier, S.H. Strauss, Q.-Y. Xiang, G.M. Plunkett, P.S. Soltis, S.M. Swensen, S.E. Williams, P.A. Gadek, C.J. Quinn, L.E. Equiarte, E. Dolenberg, G.H. Learn, Jr., S.W. Graham, S.C.H. Barrett, S. Dayandan, and V.A. Albert Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcl. Ann. Mo. Bot. 80: Felsenstein, J Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, MA. Fitch, W. M. Year. Toward finding the tree of maximum parsimony in Proceedings of the Eighth International Conference on Numerical Taxonomy. W.H. Freeman, San Francisco. Gladstein, D. S Efficient Incremental Character Optimization. Cladistics 13: Goloboff, P. A Character Optimization and Calculation of Tree Lengths. Cladistics 9: Goloboff, P. A Methods for Faster Parsimony Analysis. Cladistics 12: Goloboff, P. A NONA (NO NAme) verson 2. Published by author. Nixon, K. C The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis. Cladistics 15: Rice, K. A. Treezilla Data Sets, HThttp:// Rice, K. A., M.J. Donoghue, and R.G. Olmstead Analyzing Large Data Sets: rbcl 500 Revisited. Sys. Biol. 46: Ronquist, F Fast Fitch-Parsimony Algorithms for Large Data Sets. Cladistics 14:

9 Sankoff, D Minimal mutation trees of sequences. SIAM Journal of Applied Mathematics 28: Snell, Q., Whiting, M., Clement, M., and McLaughlin, D. Year. Parallel Phylogenetic Inference in Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, Dallas, TX. Swofford, D. L PAUP* v4.0b10: Phylogenetic Analysis Using Parsimony * (and other Methods). Sinauer Associates, Inc. 9

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance: