Cache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop

Similar documents
3.2 Cache Oblivious Algorithms

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Cache-Oblivious Algorithms

Hiroki Yasuga, Elisabeth Kolp, Andreas Lang. 25th September 2014, Scientific Programming

Space Efficient Linear Time Construction of

Sorting. Task Description. Selection Sort. Should we worry about speed?

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

We can use a max-heap to sort data.

Quicksort. The divide-and-conquer strategy is used in quicksort. Below the recursion step is described:

CS3350B Computer Architecture

Programming II (CS300)

Sorting (I) Hwansoo Han

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Suffix Array Construction

Data Compression. Guest lecture, SGDS Fall 2011

CS 171: Introduction to Computer Science II. Quicksort

CSC 273 Data Structures

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Lecture 24 November 24, 2015

An incomplex algorithm for fast suffix array construction

Sorting Algorithms Day 2 4/5/17

Cache-Oblivious Algorithms

Algorithms for dealing with massive data

Assignment 4: Question 1b omitted. Assignment 5: Question 1b clarification

CS : Data Structures Michael Schatz. Nov 14, 2016 Lecture 32: BWT

CS : Data Structures

Sorting. Sorting. Stable Sorting. In-place Sort. Bubble Sort. Bubble Sort. Selection (Tournament) Heapsort (Smoothsort) Mergesort Quicksort Bogosort

Sorting. Sorting in Arrays. SelectionSort. SelectionSort. Binary search works great, but how do we create a sorted array in the first place?

Lists Revisited: Cache Conscious STL lists

Distributed Suffix Array Construction

Better sorting algorithms (Weiss chapter )

Unit-2 Divide and conquer 2016

Divide and Conquer Algorithms

Compressed Full-Text Indexes for Highly Repetitive Collections. Lectio praecursoria Jouni Sirén

In-Place Suffix Sorting

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

QuickSort

Main Memory and the CPU Cache

Cache-Oblivious and Data-Oblivious Sorting and Applications

Chapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Real-world sorting (not on exam)

Burrows Wheeler Transform

Overview of Sorting Algorithms

Algorithms. MIS, KUAS, 2007 Ding

Analysis of parallel suffix tree construction

Divide and Conquer Sorting Algorithms and Noncomparison-based

COMP Data Structures

Report Seminar Algorithm Engineering

Sorting Algorithms. + Analysis of the Sorting Algorithms

The History of I/O Models Erik Demaine

Cache-Oblivious String Dictionaries

Sorting and Selection

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Multi-core Computing Lecture 1

Divide and Conquer Algorithms: Advanced Sorting. Prichard Ch. 10.2: Advanced Sorting Algorithms

CSC Design and Analysis of Algorithms

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

CS 171: Introduction to Computer Science II. Quicksort

Divide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions

Lecture #2. 1 Overview. 2 Worst-Case Analysis vs. Average Case Analysis. 3 Divide-and-Conquer Design Paradigm. 4 Quicksort. 4.

algorithm evaluation Performance

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

Lecture 6: Divide-and-Conquer

Linear-Time Suffix Array Implementation in Haskell

Divide and Conquer. Algorithm Fall Semester

L02 : 08/21/2015 L03 : 08/24/2015.

Lecture Notes on Quicksort

Randomized Algorithms, Quicksort and Randomized Selection

CS102 Sorting - Part 2

Experimenting with Burrows-Wheeler Compression

Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction

Data structures. More sorting. Dr. Alex Gerdes DIT961 - VT 2018

A Tale of Three Algorithms: Linear Time Suffix Array Construction

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Cache-Oblivious Algorithms EXTENDED ABSTRACT

Sorting. Popular algorithms: Many algorithms for sorting in parallel also exist.

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.

Sorting Goodrich, Tamassia Sorting 1

Sorting is ordering a list of objects. Here are some sorting algorithms

08 A: Sorting III. CS1102S: Data Structures and Algorithms. Martin Henz. March 10, Generated on Tuesday 9 th March, 2010, 09:58

Sorting. Quicksort analysis Bubble sort. November 20, 2017 Hassan Khosravi / Geoffrey Tien 1

Algorithms and Applications

Faster Sorting Methods

Latency Masking Threads on FPGAs

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Outline

Cache-Adaptive Analysis

Object-oriented programming. and data-structures CS/ENGRD 2110 SUMMER 2018

CS 360 Exam 1 Fall 2014 Name. 1. Answer the following questions about each code fragment below. [8 points]

An Efficient, Versatile Approach to Suffix Sorting

DATA STRUCTURES AND ALGORITHMS

Programming II (CS300)

Fast and Cache-Oblivious Dynamic Programming with Local Dependencies

Storage hierarchy. Textbook: chapters 11, 12, and 13

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Merge Sort & Quick Sort

Divide and Conquer Algorithms

LCP Array Construction

Quick-Sort fi fi fi 7 9. Quick-Sort Goodrich, Tamassia

Transcription:

Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop

What is Burrows-Wheeler Transform? A pre-processing step for data compression Involves sorting of all rotations of the block of data to be compressed Rationale: Transformed string compresses better

Burrows-Wheeler Transform s : abraca s : caraab, I=1 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab

Suffix sorting for BWT Suffix array construction Effect of block size, N Higher N -> Better compression -> Slower SA construction Suffix array construction algorithms Quick sort based O(N 2 (logn)) : Used in bzip2 Larsson-Sadakane algorithm : Good worst case behavior [O(NlogN) ] Poor cache performance

What we did Implemented cache-oblivious distribution sort [Frigo, Leiserson, et al] and used it in suffix sorting. Found to be a factor of 3 slower than using qsort based implementation. Developed a cache-efficient divide and conquer suffix sorting algorithm. O(N 2 lgn) time and 8N extra space Implemented an O(n) algorithm for suffix sorting[aluru and Ko 2003]. Found to be a factor of 2-3 slower than the most efficient suffix sorting algorithm available.

Incorporating cache-oblivious Distribution Sort Sadakane performs sorting based on 1 character comparisons Incorporate cache-oblivious distribution sorting of integers 1. Incurs Ө(1+(n/L)(1+log Z n))cache misses 1.Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran Cache-Oblivious Algorithms FOCS 1999

Algorithm 1. Partition A into n contiguous subarrays of size n. Recursively sort each subarray. 2. Distribute sorted subarrays into q n buckets B 1, B 2, B q of size n 1, n 2,, n q respectively, $ for i=[1,q-1] a. max{x x ε B i } min{x x ε B i+1 }, b. n i 2 n 3. Recursively sort each bucket. 4. Copy sorted buckets back to array A.

Basic Strategy Copy the element at position next of a subarray to bucket bnum. Increment bnum until we find a bucket for which element is smaller than pivot. 12 6 15 5 6 11 15 13 99 40 19 89 99 next

Bucket found 12 next 6 15 99 5 6 11 15 13 12 40 19 89 99

Performance 40 Time (in secs) 30 20 10 0 128KB 256KB 512KB 1MB 2MB Qsort 0.321 1.252 2.428 4.679 12.342 MergeSort 0.491 1.171 2.623 5.808 16.833 DistriSort 0.811 1.993 4.807 14.13 36.657 Block Size

Restrictions mallocs caused by repeated bucket-splitting. Need to keep track of state information for buckets and sub-arrays Buckets, subarrays, copying elements back and forth incur memory management overhead. Splitting into n * n subarrays, when n is not a perfect square causes rounding errors. Running time may not be dominated by cache misses.

Divide and conquer algorithm Similar to merge sort Suffix sorting : From right to left Stores match lengths to avoid repeated comparisons in the merge phase sort(char *str, int *sa, int *ml, int len){ int mid = len/2; if(len <=2){ } sort(&str[mid], &sa[mid], &ml[mid], len-mid); sort(str, sa, ml, mid); } merge(s, sa, ml, len);

Divide and conquer algorithm Suffix array s 0 1 2 3 4 5 6 7 a a a a a a a \0 SA 7 6 5 4 3 2 1 0

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a \0 last_match_length=0 Suffix array Match lengths

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a \0 last_match_length=0 Suffix array Match lengths 1 0 0 0

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a a 0 1 a \0 last_match_length=0 Suffix array Match lengths 1 0 0 0

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a a 0 1 a \0 last_match_length=0 Suffix array Match lengths 1 0 0 0

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a a 0 1 a \0 last_match_length=0 Suffix array Match lengths 1 0 0 0

Divide and conquer algorithm Sort phase s 0 1 2 3 4 5 6 7 a a a a a a a \0 0 1 2 3 a a a \0 0 1 a a 0 1 a \0 last_match_length=2 Suffix array Match lengths 1 0 1 0 0 2 0 0

Divide and conquer algorithm Merge phase 0 1 2 3 a a a \0 Old suffix array Old match lengths 1 0 1 0 0 2 0 0 New suffix array New Match lengths

Divide and conquer algorithm Merge phase 0 1 2 3 a a a \0 Old suffix array Old match lengths 1 0 1 0 0 2 0 0 New suffix array New Match lengths 3 0

Divide and conquer algorithm Merge phase 0 1 2 3 a a a \0 Old suffix array Old match lengths 1 0 1 0 0 2 0 0 New suffix array New Match lengths 3 2 0 0

Divide and conquer algorithm Merge phase 0 1 2 3 a a a \0 Old suffix array Old match lengths 1 0 1 0 0 2 0 0 New suffix array New Match lengths 3 2 1 0 0 1

Divide and conquer algorithm Merge phase 0 1 2 3 a a a \0 Old suffix array Old match lengths 1 0 1 0 0 2 0 0 New suffix array New Match lengths 3 2 1 0 0 0 1 2

Divide and conquer algorithm Merge phase Another example large match lengths a a a a a a a \0 SA 1 0 1 0 3 2 1 0 0 6 0 4 0 0 1 2

Divide and conquer algorithm Merge phase Another example large match lengths a a a a a a a \0 SA 1 0 1 0 3 2 1 0

Divide and conquer algorithm Merge phase Another example large match lengths a a a a a a a \0 SA 1 0 1 0 3 2 1 0

Divide and conquer algorithm Merge phase Another example large match lengths a a a a a a a \0 SA 1 0 1 0 3 2 1 0

Divide and conquer algorithm Analysis Time complexity O(N 2 lgn) Space complexity 8N extra space O(NlgN) random I/Os

Divide and conquer algorithm Performance Comparison with qsufsort 1 algorithm for suffix sorting Time: O(NlogN) time Space: 2 integer arrays of size N 1 N. Larsson & K.Sadakane, Faster Suffix Sorting. Technical Report, LU_CS-TR-99-214, Dept. of Computer Science, Lund University, Sweden, 1999

Divide and conquer Vs qsufsort Time (sec.) 16 14 12 10 8 6 4 2 0 256K 512K 1M 2M 4M 8M d&c 0.25 0.541 1.222 2.854 6.439 14.39 qsufsort 0.241 0.531 1.232 2.724 6.009 13.169 Blocksize, N (=datasize) - Based on Human Genome data set - Pentium4 2.4 GHz, 256MB RAM

Divide and conquer algorithm Cache performance Blocksize N = 1,048,576 Input file: reut2-013.sgm (Reuters corpus) [1MB] qsufsort d&c # data references 525,305K 1,910,425K L1 data cache misses 14,614K 13,707K cache miss ratio 2.7% 0.7% - 16KB L1 set associative data cache

Divide and conquer algorithm Cache performance Blocksize N = 1,048,576 Input file: nucall.seq (Protein sequence) [890KB] qsufsort d&c # data references 531,201K 2,626,356K L1 data cache misses 16,323K 12,945K cache miss ratio 3.0% 0.4% - 16KB L1 set associative data cache

Divide and conquer algorithm Requires no memory parameters to be set Good cache behavior Extensive testing with different types of files is needed to evaluate its utility

Linear time construction of Suffix Arrays 1 1 Pang Ko and Srinivas Aluru, Space Efficient Linear Time Construction of Suffix Arrays 2003

Classify as type S or L T M I S S I S S I P P I $ Type L S L L S L L S L L L S Pos 1 2 3 4 5 6 7 8 9 10 11 12 S: T i < T i+1 L: T i+1 < T i

Sorting Type S suffixes T M I S S I S S I P P I $ Type L S L L S L L S L L L S Pos 1 2 3 4 5 6 7 8 9 10 11 12 Order of Type S suffixes: 12 8 2 5

Sorting Type S suffixes T M I S S I S S I P P I $ Type L S L L S L L S L L L S Pos 1 2 3 4 5 6 7 8 9 10 11 12 Dist 0 0 1 2 3 1 2 3 1 2 3 4 1 2 3 4 9 3 6 10 4 7 5 8 11 12 12 2 5 8 12 8 5 2

Bucket acc. to first character T M I S S I S S I P P I $ Type L S L L S L L S L L L S Pos 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array $ I M P S 12 2 5 8 11 1 9 10 3 4 6 7 Order of Type S suffixes 12 8 5 2

Obtaining the sorted order 12 2 5 8 11 1 9 10 3 4 6 7 Move to end of Bucket as per B 12 8 5 2 Type S suffixes : B 12 11 8 5 2 1 9 10 3 4 6 7 Scan L to R- Move type L suffixes to front of bucket 12 11 8 5 2 1 10 9 7 4 6 3

Implementation Results Size of the file Linear Algo qsufsort 8KB 0.02s 0.005s 16KB 0.047s 0.01s 100KB 0.263s 0.06s 500KB 1.352s 0.551s 1MB 2.963s 1.292s File used: Genome Chromosome sequence Block Size = Size of file 2MB 6.963s 2.894s

Performance 8 TIme (in secs) 6 4 2 0 8KB 16KB 100KB 500KB 1MB 2MB Linear Algo 0.02 0.047 0.263 1.352 2.963 6.963 qsufsort 0.005 0.01 0.06 0.551 1.292 2.894 File Size

Observations Using 3 integer arrays of size n, 3 boolean arrays of (2 of size n, 1 of size n/2) Gives rise to 12n bytes plus 2.5 bits, in comparison to 8n bytes used by Manber and Myers O(n log n) algorithm Trade-off between time and space. Implementation still crude. Further optimizations possible. An extra integer array to store the Reverse positions of the Suffix array in the string improves performance.

Conclusions The cache-oblivious Distribution Sort based suffix sorting incurs memory management overheads. Factor of 3 to 4 slower than qsort based approach. Our Divide and conquer algorithm is cache-efficient and requires no memory parameters to be set. O(N 2 lgn) time and 8N extra space Linear time suffix sorting algorithm s performance can be improved further. Requires more space. Factor of 2 slower than qsufsort.