Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures
|
|
- Ashley Flowers
- 6 years ago
- Views:
Transcription
1 Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium, charlotte.herzeel ashby@imec.be 2 Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium wdmeuter@vub.ac.be 3 Intel, Veldkant 31, B-2550 Kontich, Belgium pascal.costanza@intel.com 4 ExaScience Life Lab, Kapeldreef 75, B-3001 Leuven, Belgium, Abstract. Running BWA in multithreaded mode on a multi-socket server results in poor scaling behaviour. This is because the current parallelisation strategy does not take into account the load imbalance that is inherent to the properties of the data being aligned, e.g. varying read lengths and numbers of mutations. Additional load imbalance is also caused by the BWA code not anticipating certain hardware characteristics of multi-socket multicores, such as the non-uniform memory access time of the different cores. We show that rewriting the parallel section using Cilk removes the load imbalance, resulting in a factor two performance improvement over the original BWA. Keywords: BWA, multithreading, NUMA, load balancing, Cilk 1 Introduction Burrows-Wheeler Aligner (BWA) [1] by Li and Durbin is a widely used short read alignment tool. It uses the Burrows-Wheeler transformation of the reference genome, which not only minimises the memory needed to store the reference, but also allows for a strategy for matching the reads that operates in the order of the read length. The technique was originally proposed in the context of text compression [5] and the matching process needed to be adapted for short read alignment to handle mismatches due to mutations (such as SNPs) and indels [1]. There are different options to handling mismatches, and BWA presents one solution. Other short read aligners based on Burrows-Wheeler transformation, such as Bowtie and SOAP2, use different strategies for mismatches, which are considered to produce faster albeit less accurate results than BWA [1, 6, 7]. In order to reduce the overall execution time, BWA supports multithreaded execution when appropriate hardware resources are available. In this mode, the
2 2 Charlotte Herzeel et al. reads are evenly distributed over the available cores of a multicore processor so that they can be aligned in parallel. In theory, this should give a linear speedup compared to sequential or single-core execution. To evaluate the effectiveness of BWA s multithreaded mode, we set up a scaling experiment on both a 12-core and a 40-core multi-socket processor. The workload we use is a read set from the 1000 Genomes Project [8] (NA20589) with approximately 15 million of soft-clipped reads between bp. Fig.1shows the scaling of our workload on an Intel Xeon X5660 processor with 12 cores (2 sockets 6 cores). BWA does not achieve linear speedup (red line). At 12 threads, we measure that BWA achieves 9x speedup, 73% of the potential, linear, speedup (blue line). The scaling behaviour of BWA gets worse as the number of cores and sockets of the target processor increases. Fig.2 shows the scaling behaviour of our workload on a 40-core Intel Xeon E processor (4 sockets 10 cores). Again, BWA does not achieve linear speedup (red line). The speedup measured for 40 cores indicates a speedup of 13.9x, only 35% of the potential (blue line). Fig. 1. Scaling of BWA on 12 cores. Red: linear speedup. Blue: measured speedup. Our hypothesis is that the bad scaling behaviour of BWA is due to the fact that the parallelisation of BWA does not take into account load balancing. BWA evenly distributes the reads over the available cores at the beginning of the execution, but this may cause load imbalance when different reads require different times to process. In the worst case, an unlucky core gets all the difficult reads so that it still has to work while other cores are idle because they finished aligning their cheap reads. We claim that the cost of aligning reads varies because both the read length and numbers of mutations varies for the different reads in a workload. Also, the memory layout and the NUMA architecture of our multisocket processors has an impact on the alignment cost of individual reads.
3 Load Balancing in BWA 3 Fig. 2. Scaling of BWA on 40 cores. Red: linear speedup. Blue: measured speedup. In the rest of this paper, we analyse the cause of the load imbalance in BWA in more detail. We also present a Cilk-based parallelisation strategy that removes the load imbalance, allowing us to achieve more than a factor two speedup compared to the original code. 2 BWA Implementation Short read alignment is a massively data parallel problem. In a typical workload millions of reads, up to 100bp (200 characters) long, need to be aligned. Reads are aligned independently from one another. Hence read alignment can be parallelised as a data parallel loop over the read set. Concretely, the BWA code 5 sets up pthreads equal to the number of cores on the target processor. 6 Each pthread executes a sequential alignment loop for an equal share of the reads. Linear speedup for such an implementation is only guaranteed if the work to be done is roughly equal for each pthread, in order to avoid load imbalance. To detect if there is load imbalance possible, we inspect the algorithm that is executed to align reads. 2.1 Burrows-Wheeler Alignment Algorithm The algorithm underlying the BWA code is well-documented [1], but we repeat it briefly to discuss the challenges it presents for efficient multithreaded execution. The Burrows-Wheeler alignment algorithm relies on the definition of two auxiliary data structures. These data structures are defined in terms of a compressible version of the reference, which is created via the so-called Burrows-Wheeler 5 We always refer to the latest version of BWA, i.e. the bwa download on [2]. 6 This is actually configured via the -t parameter.
4 4 Charlotte Herzeel et al. transformation. E.g. BW T (abracadabra) would be ard$rcaaaabb. Given the Burrows-Wheeler transformation of the reference, the table c tab stores for each character c in the (genetic) alphabet how many characters occur in the transformation that are lexicographically smaller than c. A second table, occ tab is defined so that a function occ(occ tab, c, k) returns the number of occurrences of the character c in the prefix BW T (ref)[1...k]. In principle, the table occ tab has for each character as many entries as the length of the reference, but BWA only stores the information for every 32 characters. For the human reference, occ tab is around 3GB large [1]. Given the tables c tab and occ tab, finding out where (or whether) a read matches the reference, is a simple calculation. Fig.3 shows pseudo code for matching a read. The code consists of a loop that iterates over the characters of the read (r). Each iteration references the occ tab and c tab to compute a new starting point (sp) and end point (ep), which represent a range from which the indexes where the read matches the reference can be calculated. The code in Fig.3 actually only works for reads that match the reference exactly. For reads with mutations or indels, additional work is needed. For inexact matches, multiple alternative matches are checked and explored using a priority queue to direct the order in which the alternatives are explored. It is not important at this point to understand all the details, the structure of the code remains roughly the same as in Fig.3. What is important to note is that for inexact matches additional work is necessary. This is also observed by Li et al. in their GPU implementation of SOAP2 [7]. The code in Fig.3 embodies certain patterns that have consequences for multithreaded code: 1. The ratio of memory operations versus other operations is high: 28% (computed with Intel R VTune TM Amplifier XE 2013). Memory operations may have a high latency and stall processors. 2. In standard multicore servers, cores are clustered in sockets. Cores on different sockets have different access times to different regions in memory, cf. non-uniform memory access (NUMA). architecture. A core can access memory on its own socket faster than memory on other, remote sockets. By default, each pthread allocates memory on its own socket. In BWA, the tables c tab and occ tab are allocated at the beginning of the execution, before the alignment starts. This means that all pthreads which are not on the first socket will have slower access time to these tables. 3. Aligning a read that matches some part of the reference exactly is cheaper than matching a read that has mutations or indels. 4. Reads have varying lengths when quality clipping is used in our example workload between bp. Since each character of a read needs to be processed by the loop in Fig.3, longer reads will take longer to match. The above points are all sources for load imbalance amongst the pthreads: There is load imbalance because certain threads will have slower access to the c tab and occ tab tables, and there is load imbalance because certain threads will have to handle longer or more mutated reads than others.
5 Load Balancing in BWA 5 def exact_bwc(r): n = len(r) i = n - 1 c = r[i] sp = c tab[c] + 1 ep = c tab[next_letter(c, abc)] j = i - 1 while not(j < 0 or sp > ep): nc = r[j] sp = c tab[nc] + occ( occ tab, nc, sp - 1) + 1 ep = c tab[nc] + occ( occ tab, nc, ep) j -= 1 return (ep - sp + 1, sp, ep) Fig. 3. Alignment of a read (exact) 2.2 Measuring Load Imbalance We can confirm the predicted load imbalance by measuring the average time each pthread needs to align its read set. Fig.4 shows the time we measure per pthread on our 12-core processor. 7 Fig.5 shows the same for the 40-core processor. In both cases, there is a clear load imbalance between the pthreads. Fig. 4. BWA load imbalance on 12 cores. 3 Removing Load Imbalance with Cilk Intel R Cilk TM Plus [3, 4] is an extension for C/C++ for task-based parallel programming. It provides constructs for expressing fork/join patterns and parallel 7 Averages for 5 runs. Same distribution of reads across cores for each run.
6 6 Charlotte Herzeel et al. Fig. 5. BWA load imbalance on 40 cores. for loops. These constructs are mapped onto tasks that are executed by a dynamic work-stealing scheduler. With work stealing, a worker thread is created for each core. Every worker thread has its own task pool, but when a worker thread runs out of tasks, it steals tasks from worker threads that are still busy. This way faster threads take over work from slower threads, balancing the overall workload. The advantage of using Cilk is that load imbalance amongst threads is handled implicitly by the work-stealing scheduler. The programmer simply focuses on identifying and creating the parallelism. 3.1 Cilk-based Parallelisation We replace the pthread-based parallel loop in BWA by a Cilk for loop. There are some intricacies with regard to making sure that each worker thread has its own priority queue for intermediate matches, to avoid contention of a shared queue. Our solution is to initialise the priority queues before executing the parallel loop, one for each worker thread. The priority queues are stored in a global array so that they are globally accessible by the worker threads. Inside the for loop, we use Cilk s introspective operator for querying the running worker thread s ID, which we then use to identify the priority queue the worker thread accesses. 3.2 Improved Scaling Results By using Cilk, the scaling behaviour of BWA improves drastically. Fig.6 compares the Cilk-based scaling (green) with the original pthread code (blue) on a 12-core Intel Xeon X5660 processor (2 sockets 6 cores). To allow direct comparison, our speedup graphs use the same baseline: 1-threaded execution of unmodified BWA. With the Cilk version, we achieve a factor 10x speedup or 82%
7 Load Balancing in BWA 7 of the potential linear speedup (red), compared to 9x speedup or 73% for the pthread version. The results are even better for the 40-core Intel Xeon E processor (4 sockets 10 cores), cf. Fig.7. There the Cilk version achieves 30x speedup, 75% of the potential, versus 13.86x speedup or 35% for the pthread version. The difference in improvement between the 12-core and 40-core processors is due to the fact that the 12-core processor has 2 sockets, whereas the 40-core processor has 4. Hence in case of the 40-core processor, cores are more distant from each other and load imbalance due to remote memory access is more severe. Our companion technical report [10] offers a more detailed technical discussion of these findings, as well as additional experiments with different data sets. Fig. 6. Scaling of BWA on 12 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. 4 Other Issues Beyond load balancing, there are a number of other issues with BWA that hamper efficient multithreaded execution. 4.1 Memory Latency and Hyperthreading When discussing the BWA algorithm in section 2.1, we saw that the ratio of memory operations versus other operations is high (28%). Memory operations have high latency and stall the processor. This is worsened by the fact that the data access pattern is random so that both caching and speculation often fail. Hyperthreading can help in multihreaded applications where the threads have
8 8 Charlotte Herzeel et al. Fig. 7. Scaling of BWA on 40 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. bad latencies. We see a positive effect of hyperthreading with the Cilk-based version of BWA, achieving super linear speedup with regard to the number of cores. In contrast, activating hyperthreading has almost no effect for the original pthread-based BWA. 8 Further improvements using prefetching may be possible. 4.2 Parallel versus Sequential Section The graphs we showed so far only take into account the time spent in the parallel section of BWA. However, the parallel section only comprises the alignment of the different reads, but before the reads can be aligned, data structures need to be initialised, e.g. loading the reads from a file into memory. This part of the code makes up the sequential section of BWA as it is not parallelised. Amdahl s law states that the speedup to expect by parallelising a program is limited by the time spent in the sequential section. Fig.8 shows the timings for running BWA on 1 to 40 threads on our sample workload. The red part of a timing shows the time spent on sequential execution, whereas the blue part shows the time spent in the parallel section. As the number of threads increases, the time spent in the sequential section becomes a dominating factor in the overall execution time. Using Amdahl s law, we can predict the scaling behaviour of a program. Fig.9 shows this for BWA: The red line is the ideal scaling behaviour to expect when the parallel section scales linearly, but the sequential section stays constant. The blue and green lines show the speedups we actually measure for both the original BWA code and our Cilk version. The black line shows linear speedup with regard 8 Graphs omitted due to space restrictions, see our technical report [10].
9 Load Balancing in BWA 9 Fig. 8. Time spent on the parallel (blue) versus the sequential section (red). to the available cores. We see that the red line is little more than half of this. If we want BWA to get closer to linear speedup to reach the potential of our processor, we need to parallelise or reduce the sequential section substantially. 5 Related Work Parallel BWA (pbwa) [9] is an MPI-based implementation of BWA for clusterbased alignment, focusing on inter-node level parallelism. The improvements they claim for the multithreaded mode of BWA on a single node are already integrated with the (latest) version of BWA (bwa-0.6.2) that we adapted. Hence that work is complementary to ours. 6 Conclusions The multithreaded mode of BWA scales poorly on multi-socket multicore processors because the parallelisation strategy, which evenly distributes the reads amongst available cores, suffers from load imbalance. We remove the load imbalance by rewriting the parallel section of BWA in Cilk, a task parallel extension for C/C++ based on a work-stealing scheduler that is capable of dynamically load balancing running programs. Using Cilk, we improve the scaling behaviour of BWA by more than a factor two, as shown by our experiments on both a 12-core and a 40-core processor. We refer the reader our technical report for experiments with additional data sets and a more detailed discussion [10]. Other issues to investigate in the future include the possible latency and bandwidth problems caused by the high number of memory operations, strategies for further reducing the NUMA penalties such as replication of data structures, as well as reducing the proportionally large sequential section.
10 10 Charlotte Herzeel et al. Fig. 9. BWA overall scaling versus potential scaling. Black: linear speedup. Red: theoretical speedup via Amdhal s law, when the parallel section would scale linearly. Blue: speedup measured for original BWA parallelisation using pthreads. Green: speedup measured for our Cilk solution. Acknowledgments. This work is funded by Intel, Janssen Pharmaceutica and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT). References 1. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 2009, 25(14): (2009) 2. Burrows-Wheeler Aligner, 3. Leiserson, Ch. E.: The Cilk++ concurrency platform, The Journal of Supercomputing, vol. 51, iss. 3, pp , March 2010, Kluwer Academic Publishers (2010) 4. Intel Cilk Plus, 5. Farragina, P., Manzini, G., Opportunistic data structures with applications, In: 41st IEEE Annual Symposium on Foundations of Computer Science, pp , IEEE Computer Society, Los Alamitos, CA, USA (2000) 6. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 2009, 10:R25 (2009) 7. Li, R., Yu, Ch. et al,, SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics (15): (2009) Genomes Project, 9. Peters, D., Luo, X., Qiu, K., Liang, P., Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pbwa, J. Appl. Bioinform. Comput. Biol. (2012) 10. Herzeel C., Costanza P., Ashby T., Wuyts R., Performance Analysis of BWA Alignment, Technical report, ExaScience Life Lab (2013)
Performance Analysis of BWA Alignment
Technical Report Performance Analysis of BWA Alignment November 5, 2013 Charlotte Herzeel Pascal Costanza Thomas J. Ashby Roel Wuyts Performance Analysis of BWA alignment Charlotte Herzeel 1,2, Pascal
More informationRead Mapping. Slides by Carl Kingsford
Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology
More informationProcessing Genomics Data: High Performance Computing meets Big Data. Jan Fostier
Processing Genomics Data: High Performance Computing meets Big Data Jan Fostier Traditional HPC way of doing things Communication network (Infiniband) Lots of communication c c c c c Lots of computations
More informationELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018
ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN
More informationHigh-performance short sequence alignment with GPU acceleration
Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationHalvade: scalable sequence analysis with MapReduce
Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier
More informationarxiv: v1 [cs.ds] 15 Nov 2018
Vectorized Character Counting for Faster Pattern Matching Roman Snytsar Microsoft Corp., One Microsoft Way, Redmond WA 98052, USA Roman.Snytsar@microsoft.com Keywords: Parallel Processing, Vectorization,
More informationScalable RNA Sequencing on Clusters of Multicore Processors
JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMultithreaded FPGA Acceleration of DNA Sequence Mapping
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationUsing MPI One-sided Communication to Accelerate Bioinformatics Applications
Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data
More informationAnalytical Modeling of Parallel Programs
2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &
More informationBRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]
BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] Elena Y. Harris 1, Nadia Ponts 2,3, Karine G. Le Roch 2 and Stefano Lonardi 1 1 Department of Computer Science
More informationGPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies
More informationAn FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm
An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels and Zaid Al-Ars Faculty of EEMCS, Delft University of Technology,
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationIntroduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More information6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationAnalysis of parallel suffix tree construction
168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationOptimizing Multi-Core Algorithms for Pattern Search
Optimizing Multi-Core Algorithms for Pattern Search Veronica Gil-Costa 1,2, Cesar Ochoa 1 and Marcela Printista 1,2 1 LIDIC, Universidad Nacional de San Luis, Ejercito de los Andes 950, San Luis, Argentina
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationAchieving High Throughput Sequencing with Graphics Processing Units
Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationA Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing
A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing Song Liu 1,2, Yi Wang 3, Fei Wang 1,2 * 1 Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. 2 School
More informationHeterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm
Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University
More informationEfficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud
212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationCloudBurst: Highly Sensitive Read Mapping with MapReduce
Bioinformatics Advance Access published April 8, 2009 Sequence Analysis CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael C. Schatz* Center for Bioinformatics and Computational Biology,
More informationReview of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014
Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationI519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationHardware Acceleration of Genetic Sequence Alignment
Hardware Acceleration of Genetic Sequence Alignment J. Arram 1,K.H.Tsoi 1, Wayne Luk 1,andP.Jiang 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Chemical Pathology,
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationPerformance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads
Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationA Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012
A randomized algorithm is presented for fast nearest neighbors search in libraries of strings. The algorithm is discussed in the context of one of the practical applications: aligning DNA reads to a reference
More informationNEXT Generation sequencers have a very high demand
1358 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 Hardware-Acceleration of Short-Read Alignment Based on the Burrows-Wheeler Transform Hasitha Muthumala Waidyasooriya,
More informationOptimize Data Structures and Memory Access Patterns to Improve Data Locality
Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationApplication Programming
Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationUSING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)
USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationParallel Programming Patterns Overview and Concepts
Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationSEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi
SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University
More informationEnabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report
Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationIntroduction to parallel computing
Introduction to parallel computing 3. Parallel Software Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Last time Parallel hardware Multi-core
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationShort Read Alignment Algorithms
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational
More informationLam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.
Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More information27. Parallel Programming I
771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationShared Memory vs. Message Passing: the COMOPS Benchmark Experiment
Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Yong Luo Scientific Computing Group CIC-19 Los Alamos National Laboratory Los Alamos, NM 87545, U.S.A. Email: yongl@lanl.gov, Fax: (505)
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationHISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim
HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationHigh-throughput Sequence Alignment using Graphics Processing Units
High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all
More informationIntel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage
Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationChapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationTDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures
TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationDISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678
More information