SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform
|
|
- Julie Leonard
- 5 years ago
- Views:
Transcription
1 SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree of B. Tech. by Shivam Satija (B12020) under the guidance of Dr. Arti Kashyap Associate Professor SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 28 th MAY 2016
2 CERTIFICATE OF APPROVAL Certified that the End-Sem Report entitled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Varun Dutt Faculty Advisor
3 CERTIFICATE This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Arti Kashyap (Guide) Faculty Supervisor(s)
4 DECLARATION BY THE STUDENT This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Shivam Satija B12020
5 Acknowledgments I would like to express my special thanks of gratitude to my guide, Dr. Arti Kashyap who gave me the golden opportunity to do this project on the topic (SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform) under her supervision. I would also like to thank Mr. Sanjay Rathee (Ph.D Scholar, IIT Mandi) for helping me in this project. Shivam Satija i
6 Abstract Next generation sequencing (NGS) technologies are generating a huge amount of genetic data due to which conventional single-processor sequence alignment tools are unable to keep trace with them. Therefore, cloud computing and MapReduce frameworks, which use thousands of commodity machines to store and process huge datasets, are emerged as best solution for this problem of growing data. In this project, we propose a MapReduce based sequence alignment technique implemented on Apache Spark, called SparkBurst. It is like a reference genome indexing tool which generates an index (like suffix array or BWT-FM index) for reference genome by running the computation on number of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. Keywords: DNA Sequence, BWT, Spark, String Matching, Mapreduce ii
7 Table of Contents Acknowledgement Abstract i ii 1. Introduction 1 2. Objectives 2 3. Background and Related Work 2 4. Methodology 2 5. Results 7 6. Discussion and conclusions of results 9 7. Deliverables 9 8. Timeline References 11
8 Introduction Genome sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases adenine, guanine, cytosine, and thymine in a strand of DNA. An important goal of genomics is to determine the index of a particular sequence in the reference genome. This relates to the problem of string matching where we can use various algorithms. Nowadays, there is a rapid increase in the use of computer technology to manage the biological information (BioInformatics). Computers are used to collect, store, analyze and integrate genetic and biological information to use it for genotyping, metagenomics, SNP (single nucleotide polymorphisms) discovery and personal genomics. The rapid development of next generation sequencing technologies have dramatically reduced the time and cost of DNA sequencing as well as dramatically increased the size of genetic data produced by these next generation machines. There are existing single machine sequence alignment tools like BLAST, SOAP [6], RMAP [1], MAQ [2], etc. But the next generation machines are producing billions of short sequence of DNA (reads) in few days. It is projected that size of sequence data will continue to increase dramatically in future. So, the above single machine alignment tools will not be able to handle such huge datasets and therefore highly distributed computing machines will be required. In recent years, a parallel computing framework called Mapreduce which can use thousands of commodity machines for distributed computing, has emerged as an evolution. Many MapReduce based platforms like Hadoop[14], Apache Spark [15], Apache Flink [16] have emerged. These platforms provide highly parallel distributed computing environment using thousands of commodity machines to store and analyze large datasets faster and efficiently. Data generated by next-generation sequencing machines can be analyzed efficiently by using these platforms. Some initiative towards the trend of using platforms like Hadoop for sequence alignment have already been taken such as CloudBurst [5], CloudAligner, BlastReduce, etc. The results were very effective and promising. Most of these hadoop based sequence alignment algorithms are based on RMAP and Blast sequence alignment techniques. These techniques use hashing techniques which in turn consume larger chunk of memory to create hash tables for read or reference genome and match them with genome or reads respectively to find locations of reads in reference genome. But, in recent years, sequence alignment tools such as Bowtie and BWA[11] which are based on Burrow-Wheeler Transformation, have become highly popular due to their higher memory efficiency and support to flexible read lengths. Creating index of reference genome is most time consuming part for BWT based alignment tools. Therefore, we proposed a MapReduce based alignment tool implemented on Apache Spark called SparkBurst which will use parallel distributed computing to generate index and finding reads in reference genome using that index. Spark can be up to x [15] faster than Hadoop for many large scale data analysis problems by exploiting its in-memory computing capabilities. SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. It is a reference genome indexing tool. An index like suffix array or FM-index, is generated for reference genome by running hundreds of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. 1
9 Objectives Main ojective of the project is to design and implement the most used BWT based genome sequencing technique in an efficient manner using parallelization though mapreduce framework which can align reads of various lengths to reference genome to gain further performance by exploiting the capabilities of the platform Spark. Background and Related Work The next generation sequencing machines typically produces million bp reads on a single run of the machine. For existing sequence alignment techniques, mapping of this large volume of data to a genome like human genome is a great challenge. Many new sequence alignment programs have been developed in last decade to tackle the problem of accurate and efficient read mapping for such huge datasets. Techniques like Eland (Cox, 2007, unpublished material), SHRiMP ( cs.toronto.edu/shrimp), RMAP [1], MAQ [2], ZOOM [3], SeqMap [4], and CloudBurst [5] use hashing techniques to hash read sequence and scan through reference genome. These techniques have a drawback of overhead of scanning the whole reference genome when some reads are aligned. Second category of techniques like SOAPv1 [6], PASS [7], MOM [8], ProbeMatch[9] use hashing techniques to hash the genome. Therefore, these techniques can be parallelized easily. But these techniques have limitation of requirement of large volume of storage to store genome index. Recently, string matching algorithms based on BWT [10] has drawn the attention of many research groups. Techniques like Bowtie and BWA which are based on burrow-wheeler transformation, have become very popular because of their superior storage memory efficiency and support to flexible seed lengths. These BWT-based sequence alignment tools provide fast mapping of short reads of DNA sequences against reference genome sequence with small memory footprint using data structure like FM-Index built atop the BWT. Techniques like BWT and BWA are very efficient as long as size of reads and reference genome is small. But due to evolution of next-generation sequencing (NGS) machines, size of data has grown above the capabilities of single machine alignment techniques. Therefore, we need alignment techniques which run parallely on cluster of many machines to give more fast and efficient results. Many cluster based alignment techniques have been proposed in recent years. Micheal C. Schatz proposed a highly sensitive read mapping algorithm CloudBurst. CloudBurst used Hadoop implementation of MapReduce to parallelize the task using multiple machines. Recently, a new cloud based sequence alignment algorithm by Tung Nguyen et al. called as CloudAligner was proposed. Cloud aligner have performance gain over Cloudburst using better partitioning and parallel processing of reference genome as well as reads data. It has web-based interface which make it more user friendly. Methodology Suffix Index Binary Search (Approach 1) SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. There are few existing Mapreduce based alignment tools like CloudBurst and CloudAligner. Both of them, shown a great performance gain as compared to single machine alignment tools. Most of the times, CloudAligner outperforms CloudBurst in terms of time complexity. CloudAligner takes reads and reference genome data as input 2
10 and split reads over different mappers as key and whole reference genome as a value. Every mapper find location of given key (read) in value (reference genome) and produce <key, value> (<read, GenomicRegion>) pair as result. These alignment tools are concentrated on splitting reads over mappers and keeping reference genome same. Therefore, every mapper should process whole genome to find a read. But, Suffix Index Binary Search Approach is a MapReduce based alignment tool which creates an index (suffix tree ) for reference genome so that every mapper needs very less computations to search a read in reference genome. It has two phases of Mapreduce. In first phase, mappers take reference genome as input from HDFS and generate suffixes of length k as key and their location in reference genome as value. Then a shuffle task was used to sort these keys and reducer combine outputs of shuffle and generate partitions of values (suffix locations in reference genome) sorted according to keys (suffix). Output of 1 st phase is partitions of suffix array. Mainly every partition has sorted suffixes which start from a particular character. For example, for human reference genome 5 partitions can be generated which have all sorted suffixes starting with character A, C, G, T and N respectively so that during second phase we need to search every read into nearly 1/5th index of reference genome. In 2 nd phase, every mapper takes some reads as input, search every read in reference genome index partition according to its starting character and generate a <Key, value > pair where key is read and value is location of read in reference genome. MapReduce Architecture for Phase I 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 3
11 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() Figure 1. 1 st Algorithm for Phase I. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use Binary Search 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 2. 1 st Algorithm for Phase II. 4
12 MapReduce Architecture for Phase II Sparkburst BWT-FM(Approach 2) After doing with the first binary search algorithm we moved on to BWT-based algorithm. This BWT-FM based Algorithm is built on top of the previous algorithm. This algorithm build the BWT transformation using the last characters of suffixes. After doing this transformation, the count of all the characters are calculated and stored in an array. This array keeps track of count of all the smaller characters. Then using the above bwt transformation, occurrence of all the characters are counted at particular index. For eg. Count of G in BWT till 30 th index. 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 5
13 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() 19. Make Occurence(O) array 20. Now calculate Count(C) Array Figure 3. 2 nd Algorithm for Phase I. Searching for query in BWT-FM based algorithm using the below approach. This algorithm starts from the last character of query. Calculate the low and high and iterate this process over the whole query using the previously calculated low and high until low is less than high and index is greater than first character of query. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use ExactMatch Algorithm 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 4. 2 nd Algorithm for Phase II 6
14 Results These are the datasets which were used for evaluation. 100k.fa dataset was taken from AML & BRL were given by Agilent Technologies and g_cat_set was taken from the bioinformatics.bc.edu Name of Dataset BP length No. of Queries 100k.fa AML BRL g_cat_set These results mentioned below in the table are from Suffix Index Binary Search (Approach 1). Reference Genome Query Genome Time Taken (in secs) s_suis 100k.fa 100 CHR-21 AML Text 420 CHR-21 BRL Text 480 CHR-22 AML Text 433 CHR-22 BRL Text 527 SparkBurst algorithm results are presented below. SparkBurst algorithm uses BWT based FM-index Running Time for SparkBurst Algorithm Time Taken k.fa AML Text BRL Text g_cat_set s_suis CH21 CH21 CH21 Query Genome, Reference Genome Graph 1 7
15 As we can see from the graph, time for SparkBurst increases as the number of queries increases. Although we saw a dip in last dataset, this is because query dataset is of cat and we try to compare it with human, that's why it matched less. Comparison of Sparkburst with Suffix Index Binary Search is Given Below Comparison of SparkBurst vs Suffix Index Binary Search SP SIBS Graph 2 100k.fa AML Text BRL Text s_suis CH21 CH21 Comparison of SparkBurst with CloudBurst is represented below Running Time Comparison SparkBurst v/s CloudBurst Time Taken s_suis s_suis CH21 CH21 CH21 CH21 CH21 CH21 SP CB SP CB SP CB SP CB Query Genome, Reference Genome, Algorithm Graph 3 8
16 Discussion and Conclusions There are various results that can to be discussed : BWT-FM based algorithm was implemented on single machine, sparkburst parallelize the approach for this algorithm. Due to this, time to locate the index of query in reference genome decreases significantly and this makes the process, time saving. As the number of node increases, computation time decreases. But as the base pair length increaes, computation time increases. As shown in Graph 1, we can say that even length of 400 bp is not taking too much time. This algorithm also works perfectly for variable lengths, even for 400 bp long query. Second approach is comparatively better as compared to first (Graph 2) because suffix array binary search compares the suffix and query everytime, whereas Sparkburst (BWT-FM) only does the lookups. Sparkburst is built on top of first algorithm. This saves a lot of time. From the comparison graph (Graph 3), it looks that Sparkburst performance is much better than Cloudburst. Sparkburst uses BWT-FM based method which does not compare the query again and again, only performs some lookups which reduces the time significantly, whereas, cloudburst compares the seeds and then extends which takes up a lot of time. Also, spark has an added advantage of in-memory computation. Apart from this, we also looked into the inexact search based implementation. There were many challenges as insertion/ deletion cases are difficult to handle because if we consider all possible cases (i.e. Match, Mismatch, Insertion, Deletion), in total we would have 9 cases which would enlarge the search zone and would be difficult to handle in terms of memory-efficiency. So, in order to solve this problem we explored different techniques which are explained below. There can be multiple approaches to solve this problem. One of them could be to find all the exact matches first and seperate the inexact matches. Then work on inexact matches and dump those inexact matches which show a high degree of inexactness and then process the remaining. These remaining inexact sequences can be searched by using seed and extend algorithm & smith-waterman dynamic algorithm. Deliverables An algorithm which can find the index of the query in the reference genome. SparkBurst will also have the following advantages : Variable Length Support SparkBurst can find the index of any query length rather than any fixed length which gives an added advantage. Improved Load Balancing SparkBurst have an edge over recent alignment tools in load balancing. During 1 st phase of reference genome indexing, data from reference genome is splitted in lines and distributed to mappers line by line. In earlier genome indexing techniques on MapReduce reference data is splitted in partition equally depending on number of 9
17 machines. Therefore, total time to index reference is time taken by slowest machine in cluster. But in SparkBurst, we distribute data line by line, after giving n lines to n machines (n+1)th line is given to machine which is free. Therefore, faster machines will get more lines and slower machines will get less due to which execution time will be optimized. Fault-tolerant Earlier reference genome indexing techniques split data into equal partitions as number of machines available. Therefore, every machine have 1/n th part of reference genome where n is number of machines available. Since tools like Hadoop and Spark use cheap commodity hardware, therefore their chances to crash are very high. In case of such machine crash we have to re-compute whole partition which is allocated to crashed machine. SparkBurst tackle this problem very efficiently. It distributes data to machines line by line instead of partitions. Therefore, crash of a machine cause re-computing of single line only which will not cost much in term of computation time. Thus, SparkBurst will be more fault-tolerant as compared to existing genome indexing techniques. Timeline Literature reading 24/12/15 40 Applied the simple sorting technique 15/02/16 30 Applying the BWT for Exact Matching 20/03/16 30 Explore the techniques for In-Exact Matching 24/04/16 15 Analyse the results 15/05/16 8 Timeline Before and During 8th Semester 15/12/15 03/02/16 24/03/16 13/05/16 Literature reading Applied the simple sorting technique Applying the BWT for Exact Matching Explore the techniques for In-Exact Matching Analyse the results We planned to work on the ExactMatch technique and came up with BWT-FM based algorithm to solve the problem. After that we compare its results with the cloudburst (already implemented on map-reduce platform) algorithm. Apart from that we had also planned to explore the inexact techniques to solve the inexact matching problems and present different ways to solve it. 10
18 References [1] Andrew D Smith, Zhenyu Xuan, and Michael Q Zhang, Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 2008 [2] Li,H. et al. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res., 2008 [3] Lin,H. et al. Zoom! zillions of oligos mapped. Bioinformatics, 2008 [4] Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics, 2008 [5] Michael C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics, 2009 [6] Li,R. et al. SOAP: short oligonucleotide alignment program. Bioinformatics, 2008 [7] Campagna et al. PASS: a program to align short sequences. Bioinformatics, 2009 [8] Eaves HL., Gao Y., MOM: maximum oligonucleotide mapping, Bioinformatics, 2009 [9] Jung Kim, ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches, Bioinformatics, 2009 [10] Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; A block-sorting lossless data compression algorithm, 1994 [11] Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows Wheeler transform, Bioinformatics, 2009 [12] Langmead,B. et al Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biolnformatics, 2009 [13] Tung Nguyen, CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping, BMC research, 2011 [14] Hadoop Map/Reduce tutorial. common/docs/r0.20.0/mapred tutorial.html. [15] Apache Spark Map/Reduce tutorial. [16] Apache Flink. 11
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies
More informationMachine learning library for Apache Flink
Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance
More informationCloudBurst: Highly Sensitive Read Mapping with MapReduce
Bioinformatics Advance Access published April 8, 2009 Sequence Analysis CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael C. Schatz* Center for Bioinformatics and Computational Biology,
More informationEfficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud
212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah
More informationKart: a divide-and-conquer algorithm for NGS read alignment
Bioinformatics, 33(15), 2017, 2281 2287 doi: 10.1093/bioinformatics/btx189 Advance Access Publication Date: 4 April 2017 Original Paper Sequence analysis Kart: a divide-and-conquer algorithm for NGS read
More informationReview of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014
Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationShort Read Alignment. Mapping Reads to a Reference
Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements
More informationSEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi
SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University
More informationShort Read Alignment Algorithms
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationRochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm
Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationLam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.
Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine
More informationGenome 373: Mapping Short Sequence Reads I. Doug Fowler
Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION
More informationLong Read RNA-seq Mapper
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...
More informationUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window
More informationA High Performance Architecture for an Exact Match Short-Read Aligner Using Burrows-Wheeler Aligner on FPGAs
Western Michigan University ScholarWorks at WMU Master's Theses Graduate College 12-2015 A High Performance Architecture for an Exact Match Short-Read Aligner Using Burrows-Wheeler Aligner on FPGAs Dana
More informationABSTRACT. Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing. Benjamin Langmead, Master of Science, 2009
ABSTRACT Title of Document: Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing Benjamin Langmead, Master of Science, 2009 Directed By: Professor Steven L. Salzberg
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationGSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu
GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationNext generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010
Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion
More informationLecture 12: January 6, Algorithms for Next Generation Sequencing Data
Computational Genomics Fall Semester, 2010 Lecture 12: January 6, 2011 Lecturer: Ron Shamir Scribe: Anat Gluzman and Eran Mick 12.1 Algorithms for Next Generation Sequencing Data 12.1.1 Introduction Ever
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationA BIG DATA APPROACH-PARALLELIZATION OF GENE DATA USING SMITH-WATERMAN ALGORITHM ON HADOOP PLATFORM
International Journal of Latest Trends in Engineering and Technology Vol.(8)Issue(4), pp.101-106 DOI: http://dx.doi.org/10.21172/1.84.14 e-issn:2278-621x A BIG DATA APPROACH-PARALLELIZATION OF GENE DATA
More informationSequence mapping and assembly. Alistair Ward - Boston College
Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have
More informationRead Mapping. Slides by Carl Kingsford
Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology
More informationNext Generation Sequencing
Next Generation Sequencing Based on Lecture Notes by R. Shamir [7] E.M. Bakker 1 Overview Introduction Next Generation Technologies The Mapping Problem The MAQ Algorithm The Bowtie Algorithm Burrows-Wheeler
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationSAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche
SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche mkirsche@jhu.edu StringBio 2018 Outline Substring Search Problem Caching and Learned Data Structures Methods Results Ongoing work
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationIntegrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis
Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Mian Lu, Yuwei Tan, Jiuxin Zhao, Ge Bai, and Qiong Luo Hong Kong University of Science and Technology {lumian,ytan,zhaojx,gbai,luo}@cse.ust.hk
More informationHardware Acceleration of Genetic Sequence Alignment
Hardware Acceleration of Genetic Sequence Alignment J. Arram 1,K.H.Tsoi 1, Wayne Luk 1,andP.Jiang 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Chemical Pathology,
More informationBioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing
Bioinformatics Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Journal: Bioinformatics Manuscript ID: BIOINF-0-0 Category: Original Paper Date Submitted
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationAchieving High Throughput Sequencing with Graphics Processing Units
Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationParallel Mapping Approaches for GNUMAP
2011 IEEE International Parallel & Distributed Processing Symposium Parallel Mapping Approaches for GNUMAP Nathan L. Clement, Mark J. Clement, Quinn Snell and W. Evan Johnson Department of Computer Science
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationDepartment of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang
Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationIntroduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG
More informationAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation
More informationOpen Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments
Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationComparative Analysis of Range Aggregate Queries In Big Data Environment
Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.
More informationApache Spark and Hadoop Based Big Data Processing System for Clinical Research
Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Sreekanth Rallapalli 1,*, Gondkar R R 2 1 Research Scholar, R&D Centre, Bharathiyar University, Coimbatore, Tamilnadu, India.
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationBRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]
BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] Elena Y. Harris 1, Nadia Ponts 2,3, Karine G. Le Roch 2 and Stefano Lonardi 1 1 Department of Computer Science
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationHigh-performance short sequence alignment with GPU acceleration
Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationSTUDENT GRADE IMPROVEMENT INHIGHER STUDIES
STUDENT GRADE IMPROVEMENT INHIGHER STUDIES Sandhya P. Pandey Assistant Professor, The S.I.A college of Higher Education, Dombivili( E), Thane, Maharastra. Abstract: In India Higher educational institutions
More informationMultithreaded FPGA Acceleration of DNA Sequence Mapping
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason
More informationI519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationImproved VariantSpark breaks the curse of dimensionality for machine learning on genomic data
Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY Transformational
More informationInternational Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN
International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, www.ijcea.com ISSN 2321-3469 DNA PATTERN MATCHING - A COMPARATIVE STUDY OF THREE PATTERN MATCHING ALGORITHMS
More informationGPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors
21 13th IEEE International Conference on Computational Science and Engineering GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors Ashwin M. Aji, Liqing Zhang and Wu-chun Feng Department of
More informationAligners. J Fass 23 August 2017
Aligners J Fass 23 August 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-08-23
More informationComputational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop
Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing
More informationIllumina Next Generation Sequencing Data analysis
Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh Erik Saule Kamer Kaya Ümit V. Çatalyürek Dept. of Biomedical Informatics Dept. of Electrical and Computer Engineering
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationCOMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING
Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha
More informationAligners. J Fass 21 June 2017
Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21
More informationResolving Load Balancing Issues in BWA on NUMA Multicore Architectures
Resolving Load Balancing Issues in BWA on NUMA Multicore Architectures Charlotte Herzeel 1,4, Thomas J. Ashby 1,4 Pascal Costanza 3,4, and Wolfgang De Meuter 2 1 imec, Kapeldreef 75, B-3001 Leuven, Belgium,
More informationBIOINFORMATICS ORIGINAL PAPER
BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 10 2011, pages 1351 1358 doi:10.1093/bioinformatics/btr151 Sequence analysis Advance Access publication March 30, 2011 Exact and complete short-read alignment
More informationHISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim
HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng
More informationELTMaestro for Spark: Data integration on clusters
Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be
More informationData Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey
Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey Rajarshi Banerjee 1, Ravi Kumar Jha 1, Aditya Neel 1, Rituparna Samaddar (Sinha) 1 and Anindya Jyoti Pal 1 1
More informationEfficient Algorithm for Frequent Itemset Generation in Big Data
Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru
More informationHigh-throughput Sequence Alignment using Graphics Processing Units
High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all
More informationAMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping
AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping Ngoc Hieu Tran 1, * Email: nhtran@ntu.edu.sg Xin Chen 1 Email: chenxin@ntu.edu.sg 1 School of Physical and Mathematical
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationAccelrys Pipeline Pilot and HP ProLiant servers
Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationPresentation of the book BOOLEAN ARITHMETIC and its Applications
Presentation of the book BOOLEAN ARITHMETIC and its Applications This book is the handout of one Post Graduate Discipline, offered since 1973, named PEA - 5737 Boolean Equations Applied to System Engineering,
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationFast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases
Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases Robinson Silvester.A J. Cruz Antony M. Pratheepa, PhD ABSTRACT Emergent interest in genomic
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationA Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana
School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May
More information