Achieving High Throughput Sequencing with Graphics Processing Units
|
|
- Harriet Greer
- 5 years ago
- Views:
Transcription
1 Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State University, Jonesboro, AR 72467, USA 2 Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 7225, USA Abstract High throughput sequencing has become a powerful technique for genome analysis after this concept was raised in recent years. Currently, there is a huge demand from patients that have genetic diseases which cannot be satisfied due to the limitation of computation power. Though several softwares are developed using currently most efficient algorithm to deal with various types of sequencing problems, the CPU seems to be too expensive to process endless data economically because CPUs are not designed adaptive for data parallel problem. The latest Fermi architecture released by NVIDIA provides considerable number of streaming processors, bigger size of register file and 1 MB cache, which makes it very competitive for data parallel processing. This paper tries a simple sequence alignment method on GPU and compared the real world performance between CPU and GPU. Experiment shows that GPU may have a good potential with similar problems. Keywords: High Throughput Sequencing, Graphics Processing Unit 1. Introduction Nowadays, people are paying more and more attention to health care and advanced devices are designed to analyze the samples from patients. When it comes to the molecular level, the data amount becomes extremely large, which needs more computational power to work on it. Recently, the emerging High Throughput Sequencing (HTS) technology [6], [7] shows bioinfomatists a way to deal with this problem better and many multithreaded programs such Bowtie [3], BWA [4] and SOAP2 [7], have been raised for practical use. However, for sequential CPUs, sequence alignment is somehow too easy to deal with, which makes it too expensive to use smart chips like CPUs. As NVIDIA released its new Fermi architecture which provide 512 cores in one chip and gigabytes of memory, GPU seems to have great potential in taking over this job and doing it faster and more economically. In this paper, a simple way is proposed to do exact matching between massive DNA target fragments and mrna reference sequences, and performance comparisons between its CPU and GPU version are discussed. The paper is organized as follows: Section 2 gives our method, including indexing and searching phases, to do sequencing. Section 3 discussed about the detailed designs considering architectures. In Section 4, we will discuss on the experimental results. Section 5 is the related work and conclusions will be drawn in Section Algorithm Design The algorithm idea used in this paper comes from Burrows-Wheeler Transformation, which was first raised for data compression and was later developed to make an efficient index for sequence alignment. Fig. 1 illustrates how the original transformation works. caacg$a aacg$ac acg$aca cg$acaa g$acaac $acaacg $acaac g aacg$a c acaacg $ acg$ac a caacg$ a cg$aca a g$acaa c Fig. 1: Burrows-Wheeler Transformation gc$aaac The concept of BWT is to make an index of reference sequence by hashing the elements within sequence to a special order, which will benefit later searching phase and reduce searching time complexity from O(nlg(n)) of bruteforce method to O(lg(n)). Concrete implementation of BWT can be described as follows: 1) Put a $ at the end of reference sequence. 2) Copy the current sequence and shift the new sequence to right by 1 and put it below the last one for n times, given the original sequence length is n. 3) Sort the new generated block by the order of $, a, c, g, t for each column. 4) Get the last column of the sorted matrix. 2.1 A New Indexing Method for Test Inspired by BWT, we designed another way to make the index. Procedure of the new method is shown in Fig. 2. Next, we will explain it in a more detailed way. 1) We still add a $ at the end of the reference sequence. 2) Generate the same block as what BWT does. This time, we put order numbers for a, c, g and t separately for the first column.
2 3) In this approach, we only sort the first column of the matrix and make sure the small order numbers of a, c, g and t are on the top of the larger ones. 4) We get the last column as the new index. 1 2 aacg$ac 3 acg$aca 1 $acaac g 1 aacg$a c 2 acg$ac a 3 acaacg $ 2 cg$aca a 1 caacg$ a 1 g$acaa c 2 aacg$ac 3 acg$aca 1 gc$aaac Fig. 2: New indexing method 2.2 Searching Algorithm Search for: aac a ac aa Fig. 3: Searching with the new index 1 aacg$ac 2 acg$aca 3 aac The new proposed method has a brute-force searching nature, but by using the index well, several improvements can be achieved. The searching procedure is given in Fig. 3, which is very straightforward. 2.3 Making A Secondary Index Now, we are going to talk about how to improve the performance of our searching algorithm. We can make a secondary index based on the first level index generated by the method mentioned above. For the first column, since we will refer to the beginning and end of a, c, g and t many times, we can save some space and just record these position numbers for the four types of letter. This saves not only the searching time but also a lot of space for the index file. For the last column, since the a, c, g and t here are not clustered, we a: [1, 3] a: [4, 5] g: [6, 6] a: {2, 4, 5} c: {1, 6} g: {} t: {} Fig. 4: Secondary index generation can create four arrays for each of them and remember the occurance positions in the last column for each element. This can prevent the searching algorithm go to positions of wrong letters, for example, if we want a, we just go for 2, 4 and 5 positions in the last column and skip letter of other types. Generally, though it does not fundamentally reduces time complexity of searching algorithm, this indexing method saves much unnecessary time by generating a simple index in O(n) time, which is time-saving. In the experiment part, performances of CPU and GPU that we will be discussing about are based on this algorithm. 3. I/O Involved Program Design 3.1 Single-threaded Code Design for CPU Since the indexing phase of our algorithm costs less time compared with searching phase, in which unpredictable number of target sequences will be throughput as inputs, we add the indexing time to total searching time in this paper. Another important advantage of this is that we can save I/O time and load indices from hard-disk, which cost much more time than the indexing phase when the reference sequence file is very large. When we do everything in memory and never go to hard-disk, searching usually becomes faster. Fig. 5 illustrates how the data pertains our program flows between memory and hard-disk. 1) Load reference sequence file from hard-disk. 2) Generate index for reference sequence in 3) Remove original sequence file from memory, only leave the index there. 4) Load the next target sequence file from hard-disk to memory 5) Do searching for the current batch of target sequences and save results. 6) Remove the first batch of target sequences. 7) Repeat 4) to 6) for all target files. 3.2 CUDA C code design for single GPU Fig. 6 shows the procedure for a machine that has a CPU dealing with our problem. There are altogether nine steps of
3 (6) (4) s (1) (7) CPU (5) DRAM Ref-Index... D I S K (3) s (1) Reference Fig. 5: Data locality control for CPU implementation execution and data transfer for both indexing and searching phases, which will be explained more specifically next. H-Disk Tar... CPU (3) Host RAM (1) (6) (7) (12) Tar (8) (12) Dev RAM (4) Index Tar (5) (1) (9) (13) (13) G P U Fig. 6: Work and data scheduling for GPU implementation 1) Load reference sequence file from hard-disk to CPU 2) Copy reference sequences from CPU memory to GPU 3) Remove reference sequences from CPU 4) Generate index for reference sequence using GPU. 5) Remove original sequence file from GPU memory, only leave index there. 6) Load a target sequence file from hard-disk to CPU 7) Copy current batch of target sequences in CPU memory to GPU 8) Remove present target sequences in CPU Load the next batch of target sequences. 9) GPU does searching and save result in its 1) Remove current batch of target sequences from GPU 11) Repeat 6) to 1) for all target files. 12) Copy back result to CPU memory and save it to disk. 13) Remove results in GPU and CPU 3.3 Noteworthy Differences between CPU and GPU implementations 1) The GPU one has an initializing time for the first booting of the device, usually taking up to 2-3 seconds, where CPU one does not. So for small cases that can be run very fast on CPUs, GPUs have no advantage. 2) Data transfer time between host and device memory should be considered since data amount in our case is usually very large. 3) GPUs can do simple calculations very fast if programs are designed well, so indexing and searching phase can also be considered to do in GPUs, if the data transfer time can be ignored. If the indexing time requires only a little, there is no much need to do it in GPUs. Searching phases usually can be taken well on GPUs since target sequence numbers are always very large. Acceleration rate of dozens to hundreds can be expected for the searching phase if GPUs are adopted. 4. Experimental s Sequential code was written in C and tested on a machine with two Intel Xeon E554 Quad-Core CPUs (2.GHz, 4MB cache), where GPU code was written in CUDA C and tested on the same machine with two GPUs of NVIDIA Tesla 2-Seris C25. In the following part, performance comparison between these two will be given and speedup rate for GPU will be calculated out. Also, time proportion for each part of whole algorithm on CPUs and GPUs will be illustrated and discussed separately. 4.1 CPU vs. GPU Searching Time Block sorting is the most time consuming part in making index for reference strings. Fig. 7 gives the relational curves about time cost and combination number of reference strings (one reference string length = 3, ). From Fig. 7 we can see that for the algorithm proposed in this paper, searching time takes a big portion of total execution time on the CPU side while on the GPU side, it takes relatively smaller portion. This is because GPU runs much faster on the searching part compared to CPU, so given the I/O and data transfer time changes proportionally as the target sequence number increases, GPU saves more absolute time as the problem scale becomes larger and larger.
4 18 6 Execution Time (second) 16 GPU with I/O 14 CPU with I/O GPU search 12 CPU search Speedup Rate Speedup with I/O Number (Length = 87) x 1 Number (Length = 87) x 1 Fig. 7: CPU & GPU timing with and without I/O Fig. 9: GPU speedup rate with I/O 4.2 Speedup on GPU Fig. 8 and Fig. 9 illustrate two speedup curves about the pure searching time and searching time with I/O and data transfer. We can see that for pure searching algorithm, the GPU one can beat the CPU one for up to 14 times, where about 5 times speedup can be achieved when I/O and data transfer is taken into consideration. Actually, since the algorithm is not ultimately optimized, there should still be potential for GPUs to speed up this problem. Speedup Rate Speedup for searching Number (Length = 87) x 1 Fig. 8: GPU speedup rate without I/O 4.3 Overhead Breakdown with CPU & GPU Approaches 1) I/O from hard-disk For both CPU and GPU implementations, this part should take the same time, which is inevitable. The bandwidth from hard-disk to memory has always been a bottleneck for similar problems. However, if we are not using the local hard-disk but using InfiniBand to load data from remote database in parallel, the performance for both CPU and GPU once can be improved, where GPU one might benefit more because it processes data much faster than CPU one and need more data in a given time to meet its stronger computation power. 2) Data transfer between host and device memory Currently, NVIDIA GPUs are using PCIe bus to transfer data from and back between host and device memory, whose capacity is up to 4GB/s for one way transmission and 8GB/s for two way. This speed usually can satisfy GPU s computation power and will not be a bottleneck for now. A noteworthy thing about this is that asynchronous memory copy technique should be used when target sequence is too large to load for once by GPU Asynchronous copy between host and device memory can overlap with GPU computation, so either copy or computing time can be hidden by this overlapping. Which portion will be hidden depends on their time costs. 3) Time for indexing For the algorithm presented in this paper, indexing time can nearly be ignored since I/O and searching time dominate. However, in real applications, such as BWT, indices are usually made more efficient to use. But it also takes more time on indexing and the overhead cannot be ignored. In that case, indexing time should also be considered as an important portion of the whole system. 4) Time for searching This portion of time relies on many factors including indexing efficiency, I/O speed, choose of device and task partitioning design. Basically, more efficient indexing can reduce searching time whereas higher I/O speed can positively influence the performance. For device choosing, we can say GPU is better than CPU from the angle of economy since it provides more powerful tools for searching. However, whether a partitioning design is good or not is hard to tell if we just look at the surface of a specific problem. Calculations should be carefully done to find out the optimum selection for it.
5 5. Related Work RNA sequencing was one of the earliest forms of nucleotide sequencing. The major landmark of RNA sequencing is the sequence of the first complete gene and the complete genome of Bacteriophage MS2, identified and published by Walter Fiers et al. in 1972[8] and 1976[2]. In late 2 decade, high-throughput sequencing (HPS) emerged. Li R (28, 29) proposed several papers about BWT applications on short read alignment [6], [7]. Li H (28, 29) [5], [4] and Langmead (29) [3] also published several works about memory-efficient alignment. In recent years, several alignment programs such as Bowtie [3], BWA [4] and SOAP2 [7] were released. In 29, Sinnott-Armstrong et al. presented a paper about accelerating epistasis analysis in human genetics with Nvidia GeForce GTX-28 and PyCUDA programming tool [9]. Nicholas et al. (211) made a real-world performance comparison of SNPrank across programming platforms such as Python, Java and Matlab, and hardware environments: single threaded, multiple threaded and GPU, where GPU languages are restricted to Matlab and Python [1] and GPU brand is Nvidia Tesla-M16. They declared for small cases, CPU always performs better because of the data transfer to and from device 6. Conclusions and Future Work This paper proposes a way to implement fast sequence alignment on the latest version of NVIDIA GPU. From the experimental result, we can see that GPU speeds up more on the searching phase compared with CPU but delays a constant length of time on its necessary data transfer phase. This feature of GPU manifested that it has a good potential for high throughput sequencing. If the bandwidth bottleneck of loading data from hard-disk can be improved, the performance still has a great potential to keep growing; where for single threaded CPU, the computation power may not guarantee that. In future, we will try to parallelize the most advanced sequence alignment algorithm on GPU and keep investigating the GPU s capability on more applications that receive urgent concerns from medical and biological fields. References [1] Nicolas A. Davis, Ahwan Pandey, and B. A. McKinney. Real-world comparison of cpu and gpu implementations of snprand: a network analysis tool for gwas. Bioinfomatics, 27: , 211. [2] W Fiers, R Contreras, and F Duerinck. Complete nucleotide sequence of bacteriophage ms2 rna: primary and secondary structure of the replicase gene. Nature, 26:5 57, [3] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 1(3), 29. [4] H Li and R Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinfomatics, 25(14): , 29. [5] H Li, J Ruan, and Durbin R. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Research, 18: , 28. [6] R. Li. Soap: short oligonucleotide alignment program. Bioinfomatics, 24(5): , 28. [7] R. Li. Soap2: an improved ultrafast toll for short read alignment. Bioinformatics, 25(15): , 29. [8] Jou W. Min, G. Haegeman, M. Ysebaert, and Fiers W. Nucleotide sequence of the gene coding for the bacteriophage ms2 coat protein. Nature, 237(369654):82, [9] Nicolas A Sinnott-Armstrong, Casey S Greene, Fabio Cancare, and Jason H Moore. Accelerating epistasis analysis in human genetics with consumer graphics hardware. Technical report, Dartmouth Medical School, NH, USA Politecnico di Milano, Milano, Italia, 29.
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies
More informationHigh-performance short sequence alignment with GPU acceleration
Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August
More informationUsing MPI One-sided Communication to Accelerate Bioinformatics Applications
Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data
More informationSEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi
SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University
More informationIntegrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis
Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Mian Lu, Yuwei Tan, Jiuxin Zhao, Ge Bai, and Qiong Luo Hong Kong University of Science and Technology {lumian,ytan,zhaojx,gbai,luo}@cse.ust.hk
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationNext generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010
Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion
More informationReview of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014
Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationA GPU Algorithm for Comparing Nucleotide Histograms
A GPU Algorithm for Comparing Nucleotide Histograms Adrienne Breland Harpreet Singh Omid Tutakhil Mike Needham Dickson Luong Grant Hennig Roger Hoang Torborn Loken Sergiu M. Dascalu Frederick C. Harris,
More informationAn FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm
An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels and Zaid Al-Ars Faculty of EEMCS, Delft University of Technology,
More informationAccelerating the Hough Transform with CUDA on Graphics Processing Units
Accelerating the Hough Transform with CUDA on Graphics Processing Units Su Chen and Hai Jiang Department of Computer Science, Arkansas State University, Jonesboro, AR 72467, USA Abstract Circle detection
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationGPU Accelerated API for Alignment of Genomics Sequencing Data
GPU Accelerated API for Alignment of Genomics Sequencing Data Nauman Ahmed, Hamid Mushtaq, Koen Bertels and Zaid Al-Ars Computer Engineering Laboratory, Delft University of Technology, Delft, The Netherlands
More informationI519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationMultithreaded FPGA Acceleration of DNA Sequence Mapping
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason
More informationBioinformatics I. Teaching assistant(s): Eudes Barbosa Markus List
Bioinformatics I Lecturer: Jan Baumbach Teaching assistant(s): Eudes Barbosa Markus List Question How can we study protein/dna binding events on a genome-wide scale? 2 Outline Short outline/intro to ChIP-Sequencing
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical
More informationHardware Acceleration of Genetic Sequence Alignment
Hardware Acceleration of Genetic Sequence Alignment J. Arram 1,K.H.Tsoi 1, Wayne Luk 1,andP.Jiang 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Chemical Pathology,
More informationRead Mapping. Slides by Carl Kingsford
Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More information(software agnostic) Computational Considerations
(software agnostic) Computational Considerations The Issues CPU GPU Emerging - FPGA, Phi, Nervana Storage Networking CPU 2 Threads core core Processor/Chip Processor/Chip Computer CPU Threads vs. Cores
More informationLatency Masking Threads on FPGAs
Latency Masking Threads on FPGAs Walid Najjar UC Riverside & Jacquard Computing Inc. Credits } Edward B. Fernandez (UCR) } Dr. Jason Villarreal (Jacquard Computing) } Adrian Park (Jacquard Computing) }
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationProcessing Genomics Data: High Performance Computing meets Big Data. Jan Fostier
Processing Genomics Data: High Performance Computing meets Big Data Jan Fostier Traditional HPC way of doing things Communication network (Infiniband) Lots of communication c c c c c Lots of computations
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationHPC Enabling R&D at Philip Morris International
HPC Enabling R&D at Philip Morris International Jim Geuther*, Filipe Bonjour, Bruce O Neel, Didier Bouttefeux, Sylvain Gubian, Stephane Cano, and Brian Suomela * Philip Morris International IT Service
More informationIBM Power AC922 Server
IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationAccelerating Image Feature Comparisons using CUDA on Commodity Hardware
Accelerating Image Feature Comparisons using CUDA on Commodity Hardware Seth Warn, Wesley Emeneker, John Gauch, Jackson Cothren, Amy Apon University of Arkansas 1 Outline Background GPU kernel implementation
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationPacketShader: A GPU-Accelerated Software Router
PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationOn the Efficacy of Haskell for High Performance Computational Biology
On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationcalled Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil
Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The
More informationCUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationResolving Load Balancing Issues in BWA on NUMA Multicore Architectures
Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium,
More informationScalable RNA Sequencing on Clusters of Multicore Processors
JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationParallel Mapping Approaches for GNUMAP
2011 IEEE International Parallel & Distributed Processing Symposium Parallel Mapping Approaches for GNUMAP Nathan L. Clement, Mark J. Clement, Quinn Snell and W. Evan Johnson Department of Computer Science
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationApproaches to Parallel Computing
Approaches to Parallel Computing K. Cooper 1 1 Department of Mathematics Washington State University 2019 Paradigms Concept Many hands make light work... Set several processors to work on separate aspects
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationHeterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm
Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University
More informationAccelrys Pipeline Pilot and HP ProLiant servers
Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationHowdah. a flexible pipeline framework and applications to analyzing genomic data. Steven Lewis PhD
Howdah a flexible pipeline framework and applications to analyzing genomic data Steven Lewis PhD slewis@systemsbiology.org What is a Howdah? A howdah is a carrier for an elephant The idea is that multiple
More informationEfficient Computation of Radial Distribution Function on GPUs
Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida 2 Overview Introduction
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationINTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA
INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA SEQUENCING AND MOORE S LAW Slide courtesy Illumina DRAM I/F
More informationAccelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture
Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationDELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE
WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationLam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.
Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationAccelerating the Prediction of Protein Interactions
Accelerating the Prediction of Protein Interactions Alex Rodionov, Jonathan Rose, Elisabeth R.M. Tillier, Alexandr Bezginov October 21 21 Motivation The human genome is sequenced, but we don't know what
More informationArchitectures for Scalable Media Object Search
Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation
More informationIntroduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationParalization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationOPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT
OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationProcessing Technology of Massive Human Health Data Based on Hadoop
6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,
More informationHalvade: scalable sequence analysis with MapReduce
Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier
More informationBetter Security Tool Designs: Brainpower, Massive Threading, and Languages
Better Security Tool Designs: Brainpower, Massive Threading, and Languages Golden G. Richard III Professor and University Research Professor Department of Computer Science University of New Orleans Founder
More informationShort Read Alignment Algorithms
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational
More informationComparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA
Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed
More informationGPGPU introduction and network applications. PacketShaders, SSLShader
GPGPU introduction and network applications PacketShaders, SSLShader Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationJULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING
JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338
More information