Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures

Size: px
Start display at page:

Download "Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures"

Transcription

1 Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium, charlotte.herzeel ashby@imec.be 2 Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium wdmeuter@vub.ac.be 3 Intel, Veldkant 31, B-2550 Kontich, Belgium pascal.costanza@intel.com 4 ExaScience Life Lab, Kapeldreef 75, B-3001 Leuven, Belgium, Abstract. Running BWA in multithreaded mode on a multi-socket server results in poor scaling behaviour. This is because the current parallelisation strategy does not take into account the load imbalance that is inherent to the properties of the data being aligned, e.g. varying read lengths and numbers of mutations. Additional load imbalance is also caused by the BWA code not anticipating certain hardware characteristics of multi-socket multicores, such as the non-uniform memory access time of the different cores. We show that rewriting the parallel section using Cilk removes the load imbalance, resulting in a factor two performance improvement over the original BWA. Keywords: BWA, multithreading, NUMA, load balancing, Cilk 1 Introduction Burrows-Wheeler Aligner (BWA) [1] by Li and Durbin is a widely used short read alignment tool. It uses the Burrows-Wheeler transformation of the reference genome, which not only minimises the memory needed to store the reference, but also allows for a strategy for matching the reads that operates in the order of the read length. The technique was originally proposed in the context of text compression [5] and the matching process needed to be adapted for short read alignment to handle mismatches due to mutations (such as SNPs) and indels [1]. There are different options to handling mismatches, and BWA presents one solution. Other short read aligners based on Burrows-Wheeler transformation, such as Bowtie and SOAP2, use different strategies for mismatches, which are considered to produce faster albeit less accurate results than BWA [1, 6, 7]. In order to reduce the overall execution time, BWA supports multithreaded execution when appropriate hardware resources are available. In this mode, the

2 2 Charlotte Herzeel et al. reads are evenly distributed over the available cores of a multicore processor so that they can be aligned in parallel. In theory, this should give a linear speedup compared to sequential or single-core execution. To evaluate the effectiveness of BWA s multithreaded mode, we set up a scaling experiment on both a 12-core and a 40-core multi-socket processor. The workload we use is a read set from the 1000 Genomes Project [8] (NA20589) with approximately 15 million of soft-clipped reads between bp. Fig.1shows the scaling of our workload on an Intel Xeon X5660 processor with 12 cores (2 sockets 6 cores). BWA does not achieve linear speedup (red line). At 12 threads, we measure that BWA achieves 9x speedup, 73% of the potential, linear, speedup (blue line). The scaling behaviour of BWA gets worse as the number of cores and sockets of the target processor increases. Fig.2 shows the scaling behaviour of our workload on a 40-core Intel Xeon E processor (4 sockets 10 cores). Again, BWA does not achieve linear speedup (red line). The speedup measured for 40 cores indicates a speedup of 13.9x, only 35% of the potential (blue line). Fig. 1. Scaling of BWA on 12 cores. Red: linear speedup. Blue: measured speedup. Our hypothesis is that the bad scaling behaviour of BWA is due to the fact that the parallelisation of BWA does not take into account load balancing. BWA evenly distributes the reads over the available cores at the beginning of the execution, but this may cause load imbalance when different reads require different times to process. In the worst case, an unlucky core gets all the difficult reads so that it still has to work while other cores are idle because they finished aligning their cheap reads. We claim that the cost of aligning reads varies because both the read length and numbers of mutations varies for the different reads in a workload. Also, the memory layout and the NUMA architecture of our multisocket processors has an impact on the alignment cost of individual reads.

3 Load Balancing in BWA 3 Fig. 2. Scaling of BWA on 40 cores. Red: linear speedup. Blue: measured speedup. In the rest of this paper, we analyse the cause of the load imbalance in BWA in more detail. We also present a Cilk-based parallelisation strategy that removes the load imbalance, allowing us to achieve more than a factor two speedup compared to the original code. 2 BWA Implementation Short read alignment is a massively data parallel problem. In a typical workload millions of reads, up to 100bp (200 characters) long, need to be aligned. Reads are aligned independently from one another. Hence read alignment can be parallelised as a data parallel loop over the read set. Concretely, the BWA code 5 sets up pthreads equal to the number of cores on the target processor. 6 Each pthread executes a sequential alignment loop for an equal share of the reads. Linear speedup for such an implementation is only guaranteed if the work to be done is roughly equal for each pthread, in order to avoid load imbalance. To detect if there is load imbalance possible, we inspect the algorithm that is executed to align reads. 2.1 Burrows-Wheeler Alignment Algorithm The algorithm underlying the BWA code is well-documented [1], but we repeat it briefly to discuss the challenges it presents for efficient multithreaded execution. The Burrows-Wheeler alignment algorithm relies on the definition of two auxiliary data structures. These data structures are defined in terms of a compressible version of the reference, which is created via the so-called Burrows-Wheeler 5 We always refer to the latest version of BWA, i.e. the bwa download on [2]. 6 This is actually configured via the -t parameter.

4 4 Charlotte Herzeel et al. transformation. E.g. BW T (abracadabra) would be ard$rcaaaabb. Given the Burrows-Wheeler transformation of the reference, the table c tab stores for each character c in the (genetic) alphabet how many characters occur in the transformation that are lexicographically smaller than c. A second table, occ tab is defined so that a function occ(occ tab, c, k) returns the number of occurrences of the character c in the prefix BW T (ref)[1...k]. In principle, the table occ tab has for each character as many entries as the length of the reference, but BWA only stores the information for every 32 characters. For the human reference, occ tab is around 3GB large [1]. Given the tables c tab and occ tab, finding out where (or whether) a read matches the reference, is a simple calculation. Fig.3 shows pseudo code for matching a read. The code consists of a loop that iterates over the characters of the read (r). Each iteration references the occ tab and c tab to compute a new starting point (sp) and end point (ep), which represent a range from which the indexes where the read matches the reference can be calculated. The code in Fig.3 actually only works for reads that match the reference exactly. For reads with mutations or indels, additional work is needed. For inexact matches, multiple alternative matches are checked and explored using a priority queue to direct the order in which the alternatives are explored. It is not important at this point to understand all the details, the structure of the code remains roughly the same as in Fig.3. What is important to note is that for inexact matches additional work is necessary. This is also observed by Li et al. in their GPU implementation of SOAP2 [7]. The code in Fig.3 embodies certain patterns that have consequences for multithreaded code: 1. The ratio of memory operations versus other operations is high: 28% (computed with Intel R VTune TM Amplifier XE 2013). Memory operations may have a high latency and stall processors. 2. In standard multicore servers, cores are clustered in sockets. Cores on different sockets have different access times to different regions in memory, cf. non-uniform memory access (NUMA). architecture. A core can access memory on its own socket faster than memory on other, remote sockets. By default, each pthread allocates memory on its own socket. In BWA, the tables c tab and occ tab are allocated at the beginning of the execution, before the alignment starts. This means that all pthreads which are not on the first socket will have slower access time to these tables. 3. Aligning a read that matches some part of the reference exactly is cheaper than matching a read that has mutations or indels. 4. Reads have varying lengths when quality clipping is used in our example workload between bp. Since each character of a read needs to be processed by the loop in Fig.3, longer reads will take longer to match. The above points are all sources for load imbalance amongst the pthreads: There is load imbalance because certain threads will have slower access to the c tab and occ tab tables, and there is load imbalance because certain threads will have to handle longer or more mutated reads than others.

5 Load Balancing in BWA 5 def exact_bwc(r): n = len(r) i = n - 1 c = r[i] sp = c tab[c] + 1 ep = c tab[next_letter(c, abc)] j = i - 1 while not(j < 0 or sp > ep): nc = r[j] sp = c tab[nc] + occ( occ tab, nc, sp - 1) + 1 ep = c tab[nc] + occ( occ tab, nc, ep) j -= 1 return (ep - sp + 1, sp, ep) Fig. 3. Alignment of a read (exact) 2.2 Measuring Load Imbalance We can confirm the predicted load imbalance by measuring the average time each pthread needs to align its read set. Fig.4 shows the time we measure per pthread on our 12-core processor. 7 Fig.5 shows the same for the 40-core processor. In both cases, there is a clear load imbalance between the pthreads. Fig. 4. BWA load imbalance on 12 cores. 3 Removing Load Imbalance with Cilk Intel R Cilk TM Plus [3, 4] is an extension for C/C++ for task-based parallel programming. It provides constructs for expressing fork/join patterns and parallel 7 Averages for 5 runs. Same distribution of reads across cores for each run.

6 6 Charlotte Herzeel et al. Fig. 5. BWA load imbalance on 40 cores. for loops. These constructs are mapped onto tasks that are executed by a dynamic work-stealing scheduler. With work stealing, a worker thread is created for each core. Every worker thread has its own task pool, but when a worker thread runs out of tasks, it steals tasks from worker threads that are still busy. This way faster threads take over work from slower threads, balancing the overall workload. The advantage of using Cilk is that load imbalance amongst threads is handled implicitly by the work-stealing scheduler. The programmer simply focuses on identifying and creating the parallelism. 3.1 Cilk-based Parallelisation We replace the pthread-based parallel loop in BWA by a Cilk for loop. There are some intricacies with regard to making sure that each worker thread has its own priority queue for intermediate matches, to avoid contention of a shared queue. Our solution is to initialise the priority queues before executing the parallel loop, one for each worker thread. The priority queues are stored in a global array so that they are globally accessible by the worker threads. Inside the for loop, we use Cilk s introspective operator for querying the running worker thread s ID, which we then use to identify the priority queue the worker thread accesses. 3.2 Improved Scaling Results By using Cilk, the scaling behaviour of BWA improves drastically. Fig.6 compares the Cilk-based scaling (green) with the original pthread code (blue) on a 12-core Intel Xeon X5660 processor (2 sockets 6 cores). To allow direct comparison, our speedup graphs use the same baseline: 1-threaded execution of unmodified BWA. With the Cilk version, we achieve a factor 10x speedup or 82%

7 Load Balancing in BWA 7 of the potential linear speedup (red), compared to 9x speedup or 73% for the pthread version. The results are even better for the 40-core Intel Xeon E processor (4 sockets 10 cores), cf. Fig.7. There the Cilk version achieves 30x speedup, 75% of the potential, versus 13.86x speedup or 35% for the pthread version. The difference in improvement between the 12-core and 40-core processors is due to the fact that the 12-core processor has 2 sockets, whereas the 40-core processor has 4. Hence in case of the 40-core processor, cores are more distant from each other and load imbalance due to remote memory access is more severe. Our companion technical report [10] offers a more detailed technical discussion of these findings, as well as additional experiments with different data sets. Fig. 6. Scaling of BWA on 12 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. 4 Other Issues Beyond load balancing, there are a number of other issues with BWA that hamper efficient multithreaded execution. 4.1 Memory Latency and Hyperthreading When discussing the BWA algorithm in section 2.1, we saw that the ratio of memory operations versus other operations is high (28%). Memory operations have high latency and stall the processor. This is worsened by the fact that the data access pattern is random so that both caching and speculation often fail. Hyperthreading can help in multihreaded applications where the threads have

8 8 Charlotte Herzeel et al. Fig. 7. Scaling of BWA on 40 cores using Cilk. Red: linear speedup. Blue: speedup measured for original pthreads implementation. Green: speedup measured for our Cilk solution. bad latencies. We see a positive effect of hyperthreading with the Cilk-based version of BWA, achieving super linear speedup with regard to the number of cores. In contrast, activating hyperthreading has almost no effect for the original pthread-based BWA. 8 Further improvements using prefetching may be possible. 4.2 Parallel versus Sequential Section The graphs we showed so far only take into account the time spent in the parallel section of BWA. However, the parallel section only comprises the alignment of the different reads, but before the reads can be aligned, data structures need to be initialised, e.g. loading the reads from a file into memory. This part of the code makes up the sequential section of BWA as it is not parallelised. Amdahl s law states that the speedup to expect by parallelising a program is limited by the time spent in the sequential section. Fig.8 shows the timings for running BWA on 1 to 40 threads on our sample workload. The red part of a timing shows the time spent on sequential execution, whereas the blue part shows the time spent in the parallel section. As the number of threads increases, the time spent in the sequential section becomes a dominating factor in the overall execution time. Using Amdahl s law, we can predict the scaling behaviour of a program. Fig.9 shows this for BWA: The red line is the ideal scaling behaviour to expect when the parallel section scales linearly, but the sequential section stays constant. The blue and green lines show the speedups we actually measure for both the original BWA code and our Cilk version. The black line shows linear speedup with regard 8 Graphs omitted due to space restrictions, see our technical report [10].

9 Load Balancing in BWA 9 Fig. 8. Time spent on the parallel (blue) versus the sequential section (red). to the available cores. We see that the red line is little more than half of this. If we want BWA to get closer to linear speedup to reach the potential of our processor, we need to parallelise or reduce the sequential section substantially. 5 Related Work Parallel BWA (pbwa) [9] is an MPI-based implementation of BWA for clusterbased alignment, focusing on inter-node level parallelism. The improvements they claim for the multithreaded mode of BWA on a single node are already integrated with the (latest) version of BWA (bwa-0.6.2) that we adapted. Hence that work is complementary to ours. 6 Conclusions The multithreaded mode of BWA scales poorly on multi-socket multicore processors because the parallelisation strategy, which evenly distributes the reads amongst available cores, suffers from load imbalance. We remove the load imbalance by rewriting the parallel section of BWA in Cilk, a task parallel extension for C/C++ based on a work-stealing scheduler that is capable of dynamically load balancing running programs. Using Cilk, we improve the scaling behaviour of BWA by more than a factor two, as shown by our experiments on both a 12-core and a 40-core processor. We refer the reader our technical report for experiments with additional data sets and a more detailed discussion [10]. Other issues to investigate in the future include the possible latency and bandwidth problems caused by the high number of memory operations, strategies for further reducing the NUMA penalties such as replication of data structures, as well as reducing the proportionally large sequential section.

10 10 Charlotte Herzeel et al. Fig. 9. BWA overall scaling versus potential scaling. Black: linear speedup. Red: theoretical speedup via Amdhal s law, when the parallel section would scale linearly. Blue: speedup measured for original BWA parallelisation using pthreads. Green: speedup measured for our Cilk solution. Acknowledgments. This work is funded by Intel, Janssen Pharmaceutica and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT). References 1. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 2009, 25(14): (2009) 2. Burrows-Wheeler Aligner, 3. Leiserson, Ch. E.: The Cilk++ concurrency platform, The Journal of Supercomputing, vol. 51, iss. 3, pp , March 2010, Kluwer Academic Publishers (2010) 4. Intel Cilk Plus, 5. Farragina, P., Manzini, G., Opportunistic data structures with applications, In: 41st IEEE Annual Symposium on Foundations of Computer Science, pp , IEEE Computer Society, Los Alamitos, CA, USA (2000) 6. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology 2009, 10:R25 (2009) 7. Li, R., Yu, Ch. et al,, SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics (15): (2009) Genomes Project, 9. Peters, D., Luo, X., Qiu, K., Liang, P., Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pbwa, J. Appl. Bioinform. Comput. Biol. (2012) 10. Herzeel C., Costanza P., Ashby T., Wuyts R., Performance Analysis of BWA Alignment, Technical report, ExaScience Life Lab (2013)

Performance Analysis of BWA Alignment

Performance Analysis of BWA Alignment Technical Report Performance Analysis of BWA Alignment November 5, 2013 Charlotte Herzeel Pascal Costanza Thomas J. Ashby Roel Wuyts Performance Analysis of BWA alignment Charlotte Herzeel 1,2, Pascal

More information

Read Mapping. Slides by Carl Kingsford

Read Mapping. Slides by Carl Kingsford Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology

More information

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier Processing Genomics Data: High Performance Computing meets Big Data Jan Fostier Traditional HPC way of doing things Communication network (Infiniband) Lots of communication c c c c c Lots of computations

More information

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN

More information

High-performance short sequence alignment with GPU acceleration

High-performance short sequence alignment with GPU acceleration Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Halvade: scalable sequence analysis with MapReduce

Halvade: scalable sequence analysis with MapReduce Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier

More information

arxiv: v1 [cs.ds] 15 Nov 2018

arxiv: v1 [cs.ds] 15 Nov 2018 Vectorized Character Counting for Faster Pattern Matching Roman Snytsar Microsoft Corp., One Microsoft Way, Redmond WA 98052, USA Roman.Snytsar@microsoft.com Keywords: Parallel Processing, Vectorization,

More information

Scalable RNA Sequencing on Clusters of Multicore Processors

Scalable RNA Sequencing on Clusters of Multicore Processors JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Multithreaded FPGA Acceleration of DNA Sequence Mapping

Multithreaded FPGA Acceleration of DNA Sequence Mapping Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Using MPI One-sided Communication to Accelerate Bioinformatics Applications Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &

More information

BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]

BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] Elena Y. Harris 1, Nadia Ponts 2,3, Karine G. Le Roch 2 and Stefano Lonardi 1 1 Department of Computer Science

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm

An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels and Zaid Al-Ars Faculty of EEMCS, Delft University of Technology,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Analysis of parallel suffix tree construction

Analysis of parallel suffix tree construction 168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Optimizing Multi-Core Algorithms for Pattern Search

Optimizing Multi-Core Algorithms for Pattern Search Optimizing Multi-Core Algorithms for Pattern Search Veronica Gil-Costa 1,2, Cesar Ochoa 1 and Marcela Printista 1,2 1 LIDIC, Universidad Nacional de San Luis, Ejercito de los Andes 950, San Luis, Argentina

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

Achieving High Throughput Sequencing with Graphics Processing Units

Achieving High Throughput Sequencing with Graphics Processing Units Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing Song Liu 1,2, Yi Wang 3, Fei Wang 1,2 * 1 Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. 2 School

More information

Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm

Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University

More information

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud 212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CloudBurst: Highly Sensitive Read Mapping with MapReduce

CloudBurst: Highly Sensitive Read Mapping with MapReduce Bioinformatics Advance Access published April 8, 2009 Sequence Analysis CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael C. Schatz* Center for Bioinformatics and Computational Biology,

More information

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Hardware Acceleration of Genetic Sequence Alignment

Hardware Acceleration of Genetic Sequence Alignment Hardware Acceleration of Genetic Sequence Alignment J. Arram 1,K.H.Tsoi 1, Wayne Luk 1,andP.Jiang 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Chemical Pathology,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we

More information

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012 A randomized algorithm is presented for fast nearest neighbors search in libraries of strings. The algorithm is discussed in the context of one of the practical applications: aligning DNA reads to a reference

More information

NEXT Generation sequencers have a very high demand

NEXT Generation sequencers have a very high demand 1358 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 Hardware-Acceleration of Short-Read Alignment Based on the Burrows-Wheeler Transform Hasitha Muthumala Waidyasooriya,

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012) USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Parallel Programming Patterns Overview and Concepts

Parallel Programming Patterns Overview and Concepts Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 3. Parallel Software Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Last time Parallel hardware Multi-core

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Short Read Alignment Algorithms

Short Read Alignment Algorithms Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational

More information

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM. Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Yong Luo Scientific Computing Group CIC-19 Los Alamos National Laboratory Los Alamos, NM 87545, U.S.A. Email: yongl@lanl.gov, Fax: (505)

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

High-throughput Sequence Alignment using Graphics Processing Units

High-throughput Sequence Alignment using Graphics Processing Units High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information