SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform

Size: px
Start display at page:

Download "SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform"

Transcription

1 SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree of B. Tech. by Shivam Satija (B12020) under the guidance of Dr. Arti Kashyap Associate Professor SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 28 th MAY 2016

2 CERTIFICATE OF APPROVAL Certified that the End-Sem Report entitled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Varun Dutt Faculty Advisor

3 CERTIFICATE This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Arti Kashyap (Guide) Faculty Supervisor(s)

4 DECLARATION BY THE STUDENT This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Shivam Satija B12020

5 Acknowledgments I would like to express my special thanks of gratitude to my guide, Dr. Arti Kashyap who gave me the golden opportunity to do this project on the topic (SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform) under her supervision. I would also like to thank Mr. Sanjay Rathee (Ph.D Scholar, IIT Mandi) for helping me in this project. Shivam Satija i

6 Abstract Next generation sequencing (NGS) technologies are generating a huge amount of genetic data due to which conventional single-processor sequence alignment tools are unable to keep trace with them. Therefore, cloud computing and MapReduce frameworks, which use thousands of commodity machines to store and process huge datasets, are emerged as best solution for this problem of growing data. In this project, we propose a MapReduce based sequence alignment technique implemented on Apache Spark, called SparkBurst. It is like a reference genome indexing tool which generates an index (like suffix array or BWT-FM index) for reference genome by running the computation on number of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. Keywords: DNA Sequence, BWT, Spark, String Matching, Mapreduce ii

7 Table of Contents Acknowledgement Abstract i ii 1. Introduction 1 2. Objectives 2 3. Background and Related Work 2 4. Methodology 2 5. Results 7 6. Discussion and conclusions of results 9 7. Deliverables 9 8. Timeline References 11

8 Introduction Genome sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases adenine, guanine, cytosine, and thymine in a strand of DNA. An important goal of genomics is to determine the index of a particular sequence in the reference genome. This relates to the problem of string matching where we can use various algorithms. Nowadays, there is a rapid increase in the use of computer technology to manage the biological information (BioInformatics). Computers are used to collect, store, analyze and integrate genetic and biological information to use it for genotyping, metagenomics, SNP (single nucleotide polymorphisms) discovery and personal genomics. The rapid development of next generation sequencing technologies have dramatically reduced the time and cost of DNA sequencing as well as dramatically increased the size of genetic data produced by these next generation machines. There are existing single machine sequence alignment tools like BLAST, SOAP [6], RMAP [1], MAQ [2], etc. But the next generation machines are producing billions of short sequence of DNA (reads) in few days. It is projected that size of sequence data will continue to increase dramatically in future. So, the above single machine alignment tools will not be able to handle such huge datasets and therefore highly distributed computing machines will be required. In recent years, a parallel computing framework called Mapreduce which can use thousands of commodity machines for distributed computing, has emerged as an evolution. Many MapReduce based platforms like Hadoop[14], Apache Spark [15], Apache Flink [16] have emerged. These platforms provide highly parallel distributed computing environment using thousands of commodity machines to store and analyze large datasets faster and efficiently. Data generated by next-generation sequencing machines can be analyzed efficiently by using these platforms. Some initiative towards the trend of using platforms like Hadoop for sequence alignment have already been taken such as CloudBurst [5], CloudAligner, BlastReduce, etc. The results were very effective and promising. Most of these hadoop based sequence alignment algorithms are based on RMAP and Blast sequence alignment techniques. These techniques use hashing techniques which in turn consume larger chunk of memory to create hash tables for read or reference genome and match them with genome or reads respectively to find locations of reads in reference genome. But, in recent years, sequence alignment tools such as Bowtie and BWA[11] which are based on Burrow-Wheeler Transformation, have become highly popular due to their higher memory efficiency and support to flexible read lengths. Creating index of reference genome is most time consuming part for BWT based alignment tools. Therefore, we proposed a MapReduce based alignment tool implemented on Apache Spark called SparkBurst which will use parallel distributed computing to generate index and finding reads in reference genome using that index. Spark can be up to x [15] faster than Hadoop for many large scale data analysis problems by exploiting its in-memory computing capabilities. SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. It is a reference genome indexing tool. An index like suffix array or FM-index, is generated for reference genome by running hundreds of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. 1

9 Objectives Main ojective of the project is to design and implement the most used BWT based genome sequencing technique in an efficient manner using parallelization though mapreduce framework which can align reads of various lengths to reference genome to gain further performance by exploiting the capabilities of the platform Spark. Background and Related Work The next generation sequencing machines typically produces million bp reads on a single run of the machine. For existing sequence alignment techniques, mapping of this large volume of data to a genome like human genome is a great challenge. Many new sequence alignment programs have been developed in last decade to tackle the problem of accurate and efficient read mapping for such huge datasets. Techniques like Eland (Cox, 2007, unpublished material), SHRiMP ( cs.toronto.edu/shrimp), RMAP [1], MAQ [2], ZOOM [3], SeqMap [4], and CloudBurst [5] use hashing techniques to hash read sequence and scan through reference genome. These techniques have a drawback of overhead of scanning the whole reference genome when some reads are aligned. Second category of techniques like SOAPv1 [6], PASS [7], MOM [8], ProbeMatch[9] use hashing techniques to hash the genome. Therefore, these techniques can be parallelized easily. But these techniques have limitation of requirement of large volume of storage to store genome index. Recently, string matching algorithms based on BWT [10] has drawn the attention of many research groups. Techniques like Bowtie and BWA which are based on burrow-wheeler transformation, have become very popular because of their superior storage memory efficiency and support to flexible seed lengths. These BWT-based sequence alignment tools provide fast mapping of short reads of DNA sequences against reference genome sequence with small memory footprint using data structure like FM-Index built atop the BWT. Techniques like BWT and BWA are very efficient as long as size of reads and reference genome is small. But due to evolution of next-generation sequencing (NGS) machines, size of data has grown above the capabilities of single machine alignment techniques. Therefore, we need alignment techniques which run parallely on cluster of many machines to give more fast and efficient results. Many cluster based alignment techniques have been proposed in recent years. Micheal C. Schatz proposed a highly sensitive read mapping algorithm CloudBurst. CloudBurst used Hadoop implementation of MapReduce to parallelize the task using multiple machines. Recently, a new cloud based sequence alignment algorithm by Tung Nguyen et al. called as CloudAligner was proposed. Cloud aligner have performance gain over Cloudburst using better partitioning and parallel processing of reference genome as well as reads data. It has web-based interface which make it more user friendly. Methodology Suffix Index Binary Search (Approach 1) SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. There are few existing Mapreduce based alignment tools like CloudBurst and CloudAligner. Both of them, shown a great performance gain as compared to single machine alignment tools. Most of the times, CloudAligner outperforms CloudBurst in terms of time complexity. CloudAligner takes reads and reference genome data as input 2

10 and split reads over different mappers as key and whole reference genome as a value. Every mapper find location of given key (read) in value (reference genome) and produce <key, value> (<read, GenomicRegion>) pair as result. These alignment tools are concentrated on splitting reads over mappers and keeping reference genome same. Therefore, every mapper should process whole genome to find a read. But, Suffix Index Binary Search Approach is a MapReduce based alignment tool which creates an index (suffix tree ) for reference genome so that every mapper needs very less computations to search a read in reference genome. It has two phases of Mapreduce. In first phase, mappers take reference genome as input from HDFS and generate suffixes of length k as key and their location in reference genome as value. Then a shuffle task was used to sort these keys and reducer combine outputs of shuffle and generate partitions of values (suffix locations in reference genome) sorted according to keys (suffix). Output of 1 st phase is partitions of suffix array. Mainly every partition has sorted suffixes which start from a particular character. For example, for human reference genome 5 partitions can be generated which have all sorted suffixes starting with character A, C, G, T and N respectively so that during second phase we need to search every read into nearly 1/5th index of reference genome. In 2 nd phase, every mapper takes some reads as input, search every read in reference genome index partition according to its starting character and generate a <Key, value > pair where key is read and value is location of read in reference genome. MapReduce Architecture for Phase I 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 3

11 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() Figure 1. 1 st Algorithm for Phase I. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use Binary Search 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 2. 1 st Algorithm for Phase II. 4

12 MapReduce Architecture for Phase II Sparkburst BWT-FM(Approach 2) After doing with the first binary search algorithm we moved on to BWT-based algorithm. This BWT-FM based Algorithm is built on top of the previous algorithm. This algorithm build the BWT transformation using the last characters of suffixes. After doing this transformation, the count of all the characters are calculated and stored in an array. This array keeps track of count of all the smaller characters. Then using the above bwt transformation, occurrence of all the characters are counted at particular index. For eg. Count of G in BWT till 30 th index. 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 5

13 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() 19. Make Occurence(O) array 20. Now calculate Count(C) Array Figure 3. 2 nd Algorithm for Phase I. Searching for query in BWT-FM based algorithm using the below approach. This algorithm starts from the last character of query. Calculate the low and high and iterate this process over the whole query using the previously calculated low and high until low is less than high and index is greater than first character of query. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use ExactMatch Algorithm 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 4. 2 nd Algorithm for Phase II 6

14 Results These are the datasets which were used for evaluation. 100k.fa dataset was taken from AML & BRL were given by Agilent Technologies and g_cat_set was taken from the bioinformatics.bc.edu Name of Dataset BP length No. of Queries 100k.fa AML BRL g_cat_set These results mentioned below in the table are from Suffix Index Binary Search (Approach 1). Reference Genome Query Genome Time Taken (in secs) s_suis 100k.fa 100 CHR-21 AML Text 420 CHR-21 BRL Text 480 CHR-22 AML Text 433 CHR-22 BRL Text 527 SparkBurst algorithm results are presented below. SparkBurst algorithm uses BWT based FM-index Running Time for SparkBurst Algorithm Time Taken k.fa AML Text BRL Text g_cat_set s_suis CH21 CH21 CH21 Query Genome, Reference Genome Graph 1 7

15 As we can see from the graph, time for SparkBurst increases as the number of queries increases. Although we saw a dip in last dataset, this is because query dataset is of cat and we try to compare it with human, that's why it matched less. Comparison of Sparkburst with Suffix Index Binary Search is Given Below Comparison of SparkBurst vs Suffix Index Binary Search SP SIBS Graph 2 100k.fa AML Text BRL Text s_suis CH21 CH21 Comparison of SparkBurst with CloudBurst is represented below Running Time Comparison SparkBurst v/s CloudBurst Time Taken s_suis s_suis CH21 CH21 CH21 CH21 CH21 CH21 SP CB SP CB SP CB SP CB Query Genome, Reference Genome, Algorithm Graph 3 8

16 Discussion and Conclusions There are various results that can to be discussed : BWT-FM based algorithm was implemented on single machine, sparkburst parallelize the approach for this algorithm. Due to this, time to locate the index of query in reference genome decreases significantly and this makes the process, time saving. As the number of node increases, computation time decreases. But as the base pair length increaes, computation time increases. As shown in Graph 1, we can say that even length of 400 bp is not taking too much time. This algorithm also works perfectly for variable lengths, even for 400 bp long query. Second approach is comparatively better as compared to first (Graph 2) because suffix array binary search compares the suffix and query everytime, whereas Sparkburst (BWT-FM) only does the lookups. Sparkburst is built on top of first algorithm. This saves a lot of time. From the comparison graph (Graph 3), it looks that Sparkburst performance is much better than Cloudburst. Sparkburst uses BWT-FM based method which does not compare the query again and again, only performs some lookups which reduces the time significantly, whereas, cloudburst compares the seeds and then extends which takes up a lot of time. Also, spark has an added advantage of in-memory computation. Apart from this, we also looked into the inexact search based implementation. There were many challenges as insertion/ deletion cases are difficult to handle because if we consider all possible cases (i.e. Match, Mismatch, Insertion, Deletion), in total we would have 9 cases which would enlarge the search zone and would be difficult to handle in terms of memory-efficiency. So, in order to solve this problem we explored different techniques which are explained below. There can be multiple approaches to solve this problem. One of them could be to find all the exact matches first and seperate the inexact matches. Then work on inexact matches and dump those inexact matches which show a high degree of inexactness and then process the remaining. These remaining inexact sequences can be searched by using seed and extend algorithm & smith-waterman dynamic algorithm. Deliverables An algorithm which can find the index of the query in the reference genome. SparkBurst will also have the following advantages : Variable Length Support SparkBurst can find the index of any query length rather than any fixed length which gives an added advantage. Improved Load Balancing SparkBurst have an edge over recent alignment tools in load balancing. During 1 st phase of reference genome indexing, data from reference genome is splitted in lines and distributed to mappers line by line. In earlier genome indexing techniques on MapReduce reference data is splitted in partition equally depending on number of 9

17 machines. Therefore, total time to index reference is time taken by slowest machine in cluster. But in SparkBurst, we distribute data line by line, after giving n lines to n machines (n+1)th line is given to machine which is free. Therefore, faster machines will get more lines and slower machines will get less due to which execution time will be optimized. Fault-tolerant Earlier reference genome indexing techniques split data into equal partitions as number of machines available. Therefore, every machine have 1/n th part of reference genome where n is number of machines available. Since tools like Hadoop and Spark use cheap commodity hardware, therefore their chances to crash are very high. In case of such machine crash we have to re-compute whole partition which is allocated to crashed machine. SparkBurst tackle this problem very efficiently. It distributes data to machines line by line instead of partitions. Therefore, crash of a machine cause re-computing of single line only which will not cost much in term of computation time. Thus, SparkBurst will be more fault-tolerant as compared to existing genome indexing techniques. Timeline Literature reading 24/12/15 40 Applied the simple sorting technique 15/02/16 30 Applying the BWT for Exact Matching 20/03/16 30 Explore the techniques for In-Exact Matching 24/04/16 15 Analyse the results 15/05/16 8 Timeline Before and During 8th Semester 15/12/15 03/02/16 24/03/16 13/05/16 Literature reading Applied the simple sorting technique Applying the BWT for Exact Matching Explore the techniques for In-Exact Matching Analyse the results We planned to work on the ExactMatch technique and came up with BWT-FM based algorithm to solve the problem. After that we compare its results with the cloudburst (already implemented on map-reduce platform) algorithm. Apart from that we had also planned to explore the inexact techniques to solve the inexact matching problems and present different ways to solve it. 10

18 References [1] Andrew D Smith, Zhenyu Xuan, and Michael Q Zhang, Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 2008 [2] Li,H. et al. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res., 2008 [3] Lin,H. et al. Zoom! zillions of oligos mapped. Bioinformatics, 2008 [4] Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics, 2008 [5] Michael C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics, 2009 [6] Li,R. et al. SOAP: short oligonucleotide alignment program. Bioinformatics, 2008 [7] Campagna et al. PASS: a program to align short sequences. Bioinformatics, 2009 [8] Eaves HL., Gao Y., MOM: maximum oligonucleotide mapping, Bioinformatics, 2009 [9] Jung Kim, ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches, Bioinformatics, 2009 [10] Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; A block-sorting lossless data compression algorithm, 1994 [11] Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows Wheeler transform, Bioinformatics, 2009 [12] Langmead,B. et al Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biolnformatics, 2009 [13] Tung Nguyen, CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping, BMC research, 2011 [14] Hadoop Map/Reduce tutorial. common/docs/r0.20.0/mapred tutorial.html. [15] Apache Spark Map/Reduce tutorial. [16] Apache Flink. 11

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

Machine learning library for Apache Flink

Machine learning library for Apache Flink Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance

More information

CloudBurst: Highly Sensitive Read Mapping with MapReduce

CloudBurst: Highly Sensitive Read Mapping with MapReduce Bioinformatics Advance Access published April 8, 2009 Sequence Analysis CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael C. Schatz* Center for Bioinformatics and Computational Biology,

More information

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud 212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah

More information

Kart: a divide-and-conquer algorithm for NGS read alignment

Kart: a divide-and-conquer algorithm for NGS read alignment Bioinformatics, 33(15), 2017, 2281 2287 doi: 10.1093/bioinformatics/btx189 Advance Access Publication Date: 4 April 2017 Original Paper Sequence analysis Kart: a divide-and-conquer algorithm for NGS read

More information

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

Short Read Alignment. Mapping Reads to a Reference

Short Read Alignment. Mapping Reads to a Reference Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

Short Read Alignment Algorithms

Short Read Alignment Algorithms Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

NGS Data and Sequence Alignment

NGS Data and Sequence Alignment Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local

More information

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM. Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine

More information

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

Genome 373: Mapping Short Sequence Reads I. Doug Fowler Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

A High Performance Architecture for an Exact Match Short-Read Aligner Using Burrows-Wheeler Aligner on FPGAs

A High Performance Architecture for an Exact Match Short-Read Aligner Using Burrows-Wheeler Aligner on FPGAs Western Michigan University ScholarWorks at WMU Master's Theses Graduate College 12-2015 A High Performance Architecture for an Exact Match Short-Read Aligner Using Burrows-Wheeler Aligner on FPGAs Dana

More information

ABSTRACT. Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing. Benjamin Langmead, Master of Science, 2009

ABSTRACT. Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing. Benjamin Langmead, Master of Science, 2009 ABSTRACT Title of Document: Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing Benjamin Langmead, Master of Science, 2009 Directed By: Professor Steven L. Salzberg

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion

More information

Lecture 12: January 6, Algorithms for Next Generation Sequencing Data

Lecture 12: January 6, Algorithms for Next Generation Sequencing Data Computational Genomics Fall Semester, 2010 Lecture 12: January 6, 2011 Lecturer: Ron Shamir Scribe: Anat Gluzman and Eran Mick 12.1 Algorithms for Next Generation Sequencing Data 12.1.1 Introduction Ever

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

A BIG DATA APPROACH-PARALLELIZATION OF GENE DATA USING SMITH-WATERMAN ALGORITHM ON HADOOP PLATFORM

A BIG DATA APPROACH-PARALLELIZATION OF GENE DATA USING SMITH-WATERMAN ALGORITHM ON HADOOP PLATFORM International Journal of Latest Trends in Engineering and Technology Vol.(8)Issue(4), pp.101-106 DOI: http://dx.doi.org/10.21172/1.84.14 e-issn:2278-621x A BIG DATA APPROACH-PARALLELIZATION OF GENE DATA

More information

Sequence mapping and assembly. Alistair Ward - Boston College

Sequence mapping and assembly. Alistair Ward - Boston College Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have

More information

Read Mapping. Slides by Carl Kingsford

Read Mapping. Slides by Carl Kingsford Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology

More information

Next Generation Sequencing

Next Generation Sequencing Next Generation Sequencing Based on Lecture Notes by R. Shamir [7] E.M. Bakker 1 Overview Introduction Next Generation Technologies The Mapping Problem The MAQ Algorithm The Bowtie Algorithm Burrows-Wheeler

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche mkirsche@jhu.edu StringBio 2018 Outline Substring Search Problem Caching and Learned Data Structures Methods Results Ongoing work

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis

Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Mian Lu, Yuwei Tan, Jiuxin Zhao, Ge Bai, and Qiong Luo Hong Kong University of Science and Technology {lumian,ytan,zhaojx,gbai,luo}@cse.ust.hk

More information

Hardware Acceleration of Genetic Sequence Alignment

Hardware Acceleration of Genetic Sequence Alignment Hardware Acceleration of Genetic Sequence Alignment J. Arram 1,K.H.Tsoi 1, Wayne Luk 1,andP.Jiang 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Chemical Pathology,

More information

Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing

Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Bioinformatics Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Journal: Bioinformatics Manuscript ID: BIOINF-0-0 Category: Original Paper Date Submitted

More information

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Achieving High Throughput Sequencing with Graphics Processing Units

Achieving High Throughput Sequencing with Graphics Processing Units Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Parallel Mapping Approaches for GNUMAP

Parallel Mapping Approaches for GNUMAP 2011 IEEE International Parallel & Distributed Processing Symposium Parallel Mapping Approaches for GNUMAP Nathan L. Clement, Mark J. Clement, Quinn Snell and W. Evan Johnson Department of Computer Science

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Sreekanth Rallapalli 1,*, Gondkar R R 2 1 Research Scholar, R&D Centre, Bharathiyar University, Coimbatore, Tamilnadu, India.

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]

BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] Elena Y. Harris 1, Nadia Ponts 2,3, Karine G. Le Roch 2 and Stefano Lonardi 1 1 Department of Computer Science

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

High-performance short sequence alignment with GPU acceleration

High-performance short sequence alignment with GPU acceleration Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES STUDENT GRADE IMPROVEMENT INHIGHER STUDIES Sandhya P. Pandey Assistant Professor, The S.I.A college of Higher Education, Dombivili( E), Thane, Maharastra. Abstract: In India Higher educational institutions

More information

Multithreaded FPGA Acceleration of DNA Sequence Mapping

Multithreaded FPGA Acceleration of DNA Sequence Mapping Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason

More information

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY Transformational

More information

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17,  ISSN International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, www.ijcea.com ISSN 2321-3469 DNA PATTERN MATCHING - A COMPARATIVE STUDY OF THREE PATTERN MATCHING ALGORITHMS

More information

GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors

GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors 21 13th IEEE International Conference on Computational Science and Engineering GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors Ashwin M. Aji, Liqing Zhang and Wu-chun Feng Department of

More information

Aligners. J Fass 23 August 2017

Aligners. J Fass 23 August 2017 Aligners J Fass 23 August 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-08-23

More information

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing

More information

Illumina Next Generation Sequencing Data analysis

Illumina Next Generation Sequencing Data analysis Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh Erik Saule Kamer Kaya Ümit V. Çatalyürek Dept. of Biomedical Informatics Dept. of Electrical and Computer Engineering

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha

More information

Aligners. J Fass 21 June 2017

Aligners. J Fass 21 June 2017 Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21

More information

Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures

Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 10 2011, pages 1351 1358 doi:10.1093/bioinformatics/btr151 Sequence analysis Advance Access publication March 30, 2011 Exact and complete short-read alignment

More information

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng

More information

ELTMaestro for Spark: Data integration on clusters

ELTMaestro for Spark: Data integration on clusters Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be

More information

Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey

Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey Rajarshi Banerjee 1, Ravi Kumar Jha 1, Aditya Neel 1, Rituparna Samaddar (Sinha) 1 and Anindya Jyoti Pal 1 1

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information

High-throughput Sequence Alignment using Graphics Processing Units

High-throughput Sequence Alignment using Graphics Processing Units High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all

More information

AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping

AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping Ngoc Hieu Tran 1, * Email: nhtran@ntu.edu.sg Xin Chen 1 Email: chenxin@ntu.edu.sg 1 School of Physical and Mathematical

More information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James

More information

Accelrys Pipeline Pilot and HP ProLiant servers

Accelrys Pipeline Pilot and HP ProLiant servers Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Presentation of the book BOOLEAN ARITHMETIC and its Applications

Presentation of the book BOOLEAN ARITHMETIC and its Applications Presentation of the book BOOLEAN ARITHMETIC and its Applications This book is the handout of one Post Graduate Discipline, offered since 1973, named PEA - 5737 Boolean Equations Applied to System Engineering,

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases Robinson Silvester.A J. Cruz Antony M. Pratheepa, PhD ABSTRACT Emergent interest in genomic

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information