Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures

Size: px

Start display at page:

Download "Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures"

Ashley Flowers
6 years ago
Views:

1 Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium, charlotte.herzeel ashby@imec.be 2 Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium wdmeuter@vub.ac.be 3 Intel, Veldkant 31, B-2550 Kontich, Belgium pascal.costanza@intel.com 4 ExaScience Life Lab, Kapeldreef 75, B-3001 Leuven, Belgium, Abstract. Running BWA in multithreaded mode on a multi-socket server results in poor scaling behaviour. This is because the current parallelisation strategy does not take into account the load imbalance that is inherent to the properties of the data being aligned, e.g. varying read lengths and numbers of mutations. Additional load imbalance is also caused by the BWA code not anticipating certain hardware characteristics of multi-socket multicores, such as the non-uniform memory access time of the different cores. We show that rewriting the parallel section using Cilk removes the load imbalance, resulting in a factor two performance improvement over the original BWA. Keywords: BWA, multithreading, NUMA, load balancing, Cilk 1 Introduction Burrows-Wheeler Aligner (BWA) [1] by Li and Durbin is a widely used short read alignment tool. It uses the Burrows-Wheeler transformation of the reference genome, which not only minimises the memory needed to store the reference, but also allows for a strategy for matching the reads that operates in the order of the read length. The technique was originally proposed in the context of text compression [5] and the matching process needed to be adapted for short read alignment to handle mismatches due to mutations (such as SNPs) and indels [1]. There are different options to handling mismatches, and BWA presents one solution. Other short read aligners based on Burrows-Wheeler transformation, such as Bowtie and SOAP2, use different strategies for mismatches, which are considered to produce faster albeit less accurate results than BWA [1, 6, 7]. In order to reduce the overall execution time, BWA supports multithreaded execution when appropriate hardware resources are available. In this mode, the

2 2 Charlotte Herzeel et al. reads are evenly distributed over the available cores of a multicore processor so that they can be aligned in parallel. In theory, this should give a linear speedup compared to sequential or single-core execution. To evaluate the effectiveness of BWA s multithreaded mode, we set up a scaling experiment on both a 12-core and a 40-core multi-socket processor. The workload we use is a read set from the 1000 Genomes Project [8] (NA20589) with approximately 15 million of soft-clipped reads between bp. Fig.1shows the scaling of our workload on an Intel Xeon X5660 processor with 12 cores (2 sockets 6 cores). BWA does not achieve linear speedup (red line). At 12 threads, we measure that BWA achieves 9x speedup, 73% of the potential, linear, speedup (blue line). The scaling behaviour of BWA gets worse as the number of cores and sockets of the target processor increases. Fig.2 shows the scaling behaviour of our workload on a 40-core Intel Xeon E processor (4 sockets 10 cores). Again, BWA does not achieve linear speedup (red line). The speedup measured for 40 cores indicates a speedup of 13.9x, only 35% of the potential (blue line). Fig. 1. Scaling of BWA on 12 cores. Red: linear speedup. Blue: measured speedup. Our hypothesis is that the bad scaling behaviour of BWA is due to the fact that the parallelisation of BWA does not take into account load balancing. BWA evenly distributes the reads over the available cores at the beginning of the execution, but this may cause load imbalance when different reads require different times to process. In the worst case, an unlucky core gets all the difficult reads so that it still has to work while other cores are idle because they finished aligning their cheap reads. We claim that the cost of aligning reads varies because both the read length and numbers of mutations varies for the different reads in a workload. Also, the memory layout and the NUMA architecture of our multisocket processors has an impact on the alignment cost of individual reads.

3 Load Balancing in BWA 3 Fig. 2. Scaling of BWA on 40 cores. Red: linear speedup. Blue: measured speedup. In the rest of this paper, we analyse the cause of the load imbalance in BWA in more detail. We also present a Cilk-based parallelisation strategy that removes the load imbalance, allowing us to achieve more than a factor two speedup compared to the original code. 2 BWA Implementation Short read alignment is a massively data parallel problem. In a typical workload millions of reads, up to 100bp (200 characters) long, need to be aligned. Reads are aligned independently from one another. Hence read alignment can be parallelised as a data parallel loop over the read set. Concretely, the BWA code 5 sets up pthreads equal to the number of cores on the target processor. 6 Each pthread executes a sequential alignment loop for an equal share of the reads. Linear speedup for such an implementation is only guaranteed if the work to be done is roughly equal for each pthread, in order to avoid load imbalance. To detect if there is load imbalance possible, we inspect the algorithm that is executed to align reads. 2.1 Burrows-Wheeler Alignment Algorithm The algorithm underlying the BWA code is well-documented [1], but we repeat it briefly to discuss the challenges it presents for efficient multithreaded execution. The Burrows-Wheeler alignment algorithm relies on the definition of two auxiliary data structures. These data structures are defined in terms of a compressible version of the reference, which is created via the so-called Burrows-Wheeler 5 We always refer to the latest version of BWA, i.e. the bwa download on [2]. 6 This is actually configured via the -t parameter.

4 4 Charlotte Herzeel et al. transformation. E.g. BW T (abracadabra) would be ard$rcaaaabb. Given the Burrows-Wheeler transformation of the reference, the table c tab stores for each character c in the (genetic) alphabet how many characters occur in the transformation that are lexicographically smaller than c. A second table, occ tab is defined so that a function occ(occ tab, c, k) returns the number of occurrences of the character c in the prefix BW T (ref)[1...k]. In principle, the table occ tab has for each character as many entries as the length of the reference, but BWA only stores the information for every 32 characters. For the human reference, occ tab is around 3GB large [1]. Given the tables c tab and occ tab, finding out where (or whether) a read matches the reference, is a simple calculation. Fig.3 shows pseudo code for matching a read. The code consists of a loop that iterates over the characters of the read (r). Each iteration references the occ tab and c tab to compute a new starting point (sp) and end point (ep), which represent a range from which the indexes where the read matches the reference can be calculated. The code in Fig.3 actually only works for reads that match the reference exactly. For reads with mutations or indels, additional work is needed. For inexact matches, multiple alternative matches are checked and explored using a priority queue to direct the order in which the alternatives are explored. It is not important at this point to understand all the details, the structure of the code remains roughly the same as in Fig.3. What is important to note is that for inexact matches additional work is necessary. This is also observed by Li et al. in their GPU implementation of SOAP2 [7]. The code in Fig.3 embodies certain patterns that have consequences for multithreaded code: 1. The ratio of memory operations versus other operations is high: 28% (computed with Intel R VTune TM Amplifier XE 2013). Memory operations may have a high latency and stall processors. 2. In standard multicore servers, cores are clustered in sockets. Cores on different sockets have different access times to different regions in memory, cf. non-uniform memory access (NUMA). architecture. A core can access memory on its own socket faster than memory on other, remote sockets. By default, each pthread allocates memory on its own socket. In BWA, the tables c tab and occ tab are allocated at the beginning of the execution, before the alignment starts. This means that all pthreads which are not on the first socket will have slower access time to these tables. 3. Aligning a read that matches some part of the reference exactly is cheaper than matching a read that has mutations or indels. 4. Reads have varying lengths when quality clipping is used in our example workload between bp. Since each character of a read needs to be processed by the loop in Fig.3, longer reads will take longer to match. The above points are all sources for load imbalance amongst the pthreads: There is load imbalance because certain threads will have slower access to the c tab and occ tab tables, and there is load imbalance because certain threads will have to handle longer or more mutated reads than others.

5 Load Balancing in BWA 5 def exact_bwc(r): n = len(r) i = n - 1 c = r[i] sp = c tab[c] + 1 ep = c tab[next_letter(c, abc)] j = i - 1 while not(j < 0 or sp > ep): nc = r[j] sp = c tab[nc] + occ( occ tab, nc, sp - 1) + 1 ep = c tab[nc] + occ( occ tab, nc, ep) j -= 1 return (ep - sp + 1, sp, ep) Fig. 3. Alignment of a read (exact) 2.2 Measuring Load Imbalance We can confirm the predicted load imbalance by measuring the average time each pthread needs to align its read set. Fig.4 shows the time we measure per pthread on our 12-core processor. 7 Fig.5 shows the same for the 40-core processor. In both cases, there is a clear load imbalance between the pthreads. Fig. 4. BWA load imbalance on 12 cores. 3 Removing Load Imbalance with Cilk Intel R Cilk TM Plus [3, 4] is an extension for C/C++ for task-based parallel programming. It provides constructs for expressing fork/join patterns and parallel 7 Averages for 5 runs. Same distribution of reads across cores for each run.

6 6 Charlotte Herzeel et al. Fig. 5. BWA load imbalance on 40 cores. for loops. These constructs are mapped onto tasks that are executed by a dynamic work-stealing scheduler. With work stealing, a worker thread is created for each core. Every worker thread has its own task pool, but when a worker thread runs out of tasks, it steals tasks from worker threads that are still busy. This way faster threads take over work from slower threads, balancing the overall workload. The advantage of using Cilk is that load imbalance amongst threads is handled implicitly by the work-stealing scheduler. The programmer simply focuses on identifying and creating the parallelism. 3.1 Cilk-based Parallelisation We replace the pthread-based parallel loop in BWA by a Cilk for loop. There are some intricacies with regard to making sure that each worker thread has its own priority queue for intermediate matches, to avoid contention of a shared queue. Our solution is to initialise the priority queues before executing the parallel loop, one for each worker thread. The priority queues are stored in a global array so that they are globally accessible by the worker threads. Inside the for loop, we use Cilk s introspective operator for querying the running worker thread s ID, which we then use to identify the priority queue the worker thread accesses. 3.2 Improved Scaling Results By using Cilk, the scaling behaviour of BWA improves drastically. Fig.6 compares the Cilk-based scaling (green) with the original pthread code (blue) on a 12-core Intel Xeon X5660 processor (2 sockets 6 cores). To allow direct comparison, our speedup graphs use the same baseline: 1-threaded execution of unmodified BWA. With the Cilk version, we achieve a factor 10x speedup or 82%

7 Load Balancing in BWA 7 of the potential linear speedup (red), compared to 9x speedup or 73% for the pthread version. The results are even better for the 40-core Intel Xeon E processor (4 sockets 10 cores), cf. Fig.7. There the Cilk version achieves 30x speedup, 75% of the potential, versus 13.86x speedup or 35% for the pthread version. The difference in improvement between the 12-core and 40-core processors is due to the fact that the 12-core processor has 2 sockets, whereas the 40-core processor has 4. Hence in case of the 40-core processor, cores are more distant from each other and load imbalance due to remote memory access is more severe. Our companion technical report [10] offers a more detailed technical discussion of these findings, as well as additional experiments with different data sets. Fig. 6. Scaling of BWA on 12 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. 4 Other Issues Beyond load balancing, there are a number of other issues with BWA that hamper efficient multithreaded execution. 4.1 Memory Latency and Hyperthreading When discussing the BWA algorithm in section 2.1, we saw that the ratio of memory operations versus other operations is high (28%). Memory operations have high latency and stall the processor. This is worsened by the fact that the data access pattern is random so that both caching and speculation often fail. Hyperthreading can help in multihreaded applications where the threads have

8 8 Charlotte Herzeel et al. Fig. 7. Scaling of BWA on 40 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. bad latencies. We see a positive effect of hyperthreading with the Cilk-based version of BWA, achieving super linear speedup with regard to the number of cores. In contrast, activating hyperthreading has almost no effect for the original pthread-based BWA. 8 Further improvements using prefetching may be possible. 4.2 Parallel versus Sequential Section The graphs we showed so far only take into account the time spent in the parallel section of BWA. However, the parallel section only comprises the alignment of the different reads, but before the reads can be aligned, data structures need to be initialised, e.g. loading the reads from a file into memory. This part of the code makes up the sequential section of BWA as it is not parallelised. Amdahl s law states that the speedup to expect by parallelising a program is limited by the time spent in the sequential section. Fig.8 shows the timings for running BWA on 1 to 40 threads on our sample workload. The red part of a timing shows the time spent on sequential execution, whereas the blue part shows the time spent in the parallel section. As the number of threads increases, the time spent in the sequential section becomes a dominating factor in the overall execution time. Using Amdahl s law, we can predict the scaling behaviour of a program. Fig.9 shows this for BWA: The red line is the ideal scaling behaviour to expect when the parallel section scales linearly, but the sequential section stays constant. The blue and green lines show the speedups we actually measure for both the original BWA code and our Cilk version. The black line shows linear speedup with regard 8 Graphs omitted due to space restrictions, see our technical report [10].

9 Load Balancing in BWA 9 Fig. 8. Time spent on the parallel (blue) versus the sequential section (red). to the available cores. We see that the red line is little more than half of this. If we want BWA to get closer to linear speedup to reach the potential of our processor, we need to parallelise or reduce the sequential section substantially. 5 Related Work Parallel BWA (pbwa) [9] is an MPI-based implementation of BWA for clusterbased alignment, focusing on inter-node level parallelism. The improvements they claim for the multithreaded mode of BWA on a single node are already integrated with the (latest) version of BWA (bwa-0.6.2) that we adapted. Hence that work is complementary to ours. 6 Conclusions The multithreaded mode of BWA scales poorly on multi-socket multicore processors because the parallelisation strategy, which evenly distributes the reads amongst available cores, suffers from load imbalance. We remove the load imbalance by rewriting the parallel section of BWA in Cilk, a task parallel extension for C/C++ based on a work-stealing scheduler that is capable of dynamically load balancing running programs. Using Cilk, we improve the scaling behaviour of BWA by more than a factor two, as shown by our experiments on both a 12-core and a 40-core processor. We refer the reader our technical report for experiments with additional data sets and a more detailed discussion [10]. Other issues to investigate in the future include the possible latency and bandwidth problems caused by the high number of memory operations, strategies for further reducing the NUMA penalties such as replication of data structures, as well as reducing the proportionally large sequential section.

10 10 Charlotte Herzeel et al. Fig. 9. BWA overall scaling versus potential scaling. Black: linear speedup. Red: theoretical speedup via Amdhal s law, when the parallel section would scale linearly. Blue: speedup measured for original BWA parallelisation using pthreads. Green: speedup measured for our Cilk solution. Acknowledgments. This work is funded by Intel, Janssen Pharmaceutica and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT). References 1. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 2009, 25(14): (2009) 2. Burrows-Wheeler Aligner, 3. Leiserson, Ch. E.: The Cilk++ concurrency platform, The Journal of Supercomputing, vol. 51, iss. 3, pp , March 2010, Kluwer Academic Publishers (2010) 4. Intel Cilk Plus, 5. Farragina, P., Manzini, G., Opportunistic data structures with applications, In: 41st IEEE Annual Symposium on Foundations of Computer Science, pp , IEEE Computer Society, Los Alamitos, CA, USA (2000) 6. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 2009, 10:R25 (2009) 7. Li, R., Yu, Ch. et al,, SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics (15): (2009) Genomes Project, 9. Peters, D., Luo, X., Qiu, K., Liang, P., Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pbwa, J. Appl. Bioinform. Comput. Biol. (2012) 10. Herzeel C., Costanza P., Ashby T., Wuyts R., Performance Analysis of BWA Alignment, Technical report, ExaScience Life Lab (2013)

Performance Analysis of BWA Alignment

Technical Report Performance Analysis of BWA Alignment November 5, 2013 Charlotte Herzeel Pascal Costanza Thomas J. Ashby Roel Wuyts Performance Analysis of BWA alignment Charlotte Herzeel 1,2, Pascal