Parallel Multiway LCP-Mergesort

Similar documents
Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

COMP 423 lecture 11 Jan. 28, 2008

2014 Haskell January Test Regular Expressions and Finite Automata

12-B FRACTIONS AND DECIMALS

Small Business Networking

Small Business Networking

Small Business Networking

Small Business Networking

pdfapilot Server 2 Manual

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

What are suffix trees?

Small Business Networking

Small Business Networking

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization

CS201 Discussion 10 DRAWTREE + TRIES

Parallel Square and Cube Computations

MATH 25 CLASS 5 NOTES, SEP

Approximation by NURBS with free knots

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Small Business Networking

Small Business Networking

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

Scanner Termination. Multi Character Lookahead. to its physical end. Most parsers require an end of file token. Lex and Jlex automatically create an

10.5 Graphing Quadratic Functions

Presentation Martin Randers

MIPS I/O and Interrupt

UNIT 11. Query Optimization

Small Business Networking

2 Computing all Intersections of a Set of Segments Line Segment Intersection

From Dependencies to Evaluation Strategies

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Section 3.1: Sequences and Series

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation

Fig.25: the Role of LEX

Dr. D.M. Akbar Hussain

On String Matching in Chunked Texts

Engineer To Engineer Note

9 4. CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association

Section 10.4 Hyperbolas

Slides for Data Mining by I. H. Witten and E. Frank

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

Fall 2018 Midterm 1 October 11, ˆ You may not ask questions about the exam except for language clarifications.

Fault injection attacks on cryptographic devices and countermeasures Part 2

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Reducing a DFA to a Minimal DFA

Union-Find Problem. Using Arrays And Chains. A Set As A Tree. Result Of A Find Operation

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Control-Flow Analysis and Loop Detection

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Algorithm Design (5) Text Search

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

CS481: Bioinformatics Algorithms

Midterm 2 Sample solution

Engineer To Engineer Note

Misrepresentation of Preferences

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Caches I. CSE 351 Spring Instructor: Ruth Anderson

MATH 2530: WORKSHEET 7. x 2 y dz dy dx =

LECT-10, S-1 FP2P08, Javed I.

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

1 Drawing 3D Objects in Adobe Illustrator

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers

Unit 5 Vocabulary. A function is a special relationship where each input has a single output.

Distributed Systems Principles and Paradigms

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment

Intermediate Information Structures

INTRODUCTION TO SIMPLICIAL COMPLEXES

Digital Design. Chapter 6: Optimizations and Tradeoffs

Epson Projector Content Manager Operation Guide

Caches I. CSE 351 Autumn Instructor: Justin Hsia

Data sharing in OpenMP

CSE 401 Midterm Exam 11/5/10 Sample Solution

L2-Python-Data-Structures

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

Questions About Numbers. Number Systems and Arithmetic. Introduction to Binary Numbers. Negative Numbers?

such that the S i cover S, or equivalently S

What do all those bits mean now? Number Systems and Arithmetic. Introduction to Binary Numbers. Questions About Numbers

Functor (1A) Young Won Lim 10/5/17

OUTPUT DELIVERY SYSTEM

Reference types and their characteristics Class Definition Constructors and Object Creation Special objects: Strings and Arrays

Scalable Distributed Data Structures: A Survey Λ

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Lists in Lisp and Scheme

Preserving Constraints for Aggregation Relationship Type Update in XML Document

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

SOME EXAMPLES OF SUBDIVISION OF SMALL CATEGORIES

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Computer-Aided Multiscale Modelling for Chemical Process Engineering

1 Quad-Edge Construction Operators

Engineer To Engineer Note

L. Yaroslavsky. Fundamentals of Digital Image Processing. Course

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

CHAPTER III IMAGE DEWARPING (CALIBRATION) PROCEDURE

ECEN 468 Advanced Logic Design Lecture 36: RTL Optimization

Transcription:

Bchelor Thesis Prllel Multiwy LCP-Mergesort Andres Eberle Published: 2014/05/15 Supervisor: Prof. Dr. Peter Snders Dipl.-Inform. Timo Bingmnn Institute of Theoreticl Informtics, Algorithmics II Deprtment of Informtics Krlsruhe Institute of Technology

Hiermit versichere ich, dss ich diese Arbeit selbständig verfsst und keine nderen, ls die ngegebenen Quellen und Hilfsmittel benutzt, die wörtlich oder inhltlich übernommenen Stellen ls solche kenntlich gemcht und die Stzung des Krlsruher Instituts für Technologie zur Sicherung guter wissenschftlicher Prxis in der jeweils gültigen Fssung bechtet hbe. Ort, den Dtum

Abstrct In this bchelor thesis, multiwy LCP-Merge is introduced, prllelized nd pplied to crete fully prllel LCP-Mergesort, s well s NUMA optimized ps 5. As n dvncement of binry LCP-Mergesort, multiwy LCP-wre tournment tree is introduced nd prllelized. For dynmic lod blncing, one well-known nd two new strtegies for splitting merge work pckges re utilised. Besides the introduction of fully prllel multiwy LCP-Mergesort, further focus is put on NUMA rchitectures. Thus prllel Super Sclr String Smple Sort (ps 5 ) is dpted to the specil properties of these systems by utilising the prllel LCP-Merge. Moreover this yields n efficient nd generic pproch for prllelizing rbitrry sequentil string sorting lgorithms nd mking prllel lgorithms NUMAwre. Severl optimiztions, importnt for prcticl implementtions, s well s comprehensive experiments on two current NUMA pltforms, re then reported nd discussed. The experiments show the good sclbility of the introduced lgorithms nd especilly, the gret improvements of NUMAwre ps 5 with rel-world input sets on modern mchines. Zusmmenfssung In dieser Bchelorrbeit wird ein mehrwegiger LCP-Merge eingeführt, prllelisiert und für den Aufbu eines prllelen LCP-Mergesorts, sowie einer NUMA optimierten ps 5 Implementierung, ngewndt. Als Weiterentwicklung des binären LCP-Mergesortes, wird ein mehrwegiger LCP-fähiger Tournment Tree eingeführt und prllelisiert. Zur Aufteilung der Arbeitspkete, welche für eine dynmische Lstverteilung benötigt wird, werden eine beknnte, sowie zwei neu eingeführte Strtegien, genutzt. Neben der Einführung eines prllelisierten LCP-Mergesortes, wird der weitere Fokus uf NUMA Architekturen gelegt. Im Zuge dessen, wird prllel Super Sclr String Smple Sort (ps 5 ), durch Anwendung des prllelen LCP-Merges, uf die besonderen Eigenschften dieser Systeme ngepsst. Zusätzlich führt dies zu einem effizienten und generischen Anstz um sequentielle Sortierlgorithmen zu prllelisieren und bereits prllele Algorithmen um NUMA- Fähigkeit zu erweitern. Weiterhin werden einige Optimierungen, welche für prktische Implementierungen wichtig sind, sowie usgiebige Experimente uf zwei ktuellen NUMA Plttformen, erläutert und diskutiert. Die Experimente belegen mit relistischen Eingbedten die gute Sklierbrkeit der vorgestellten Algorithmen und besonders die enormen Verbesserungen des ps 5 Algorithmus uf NUMA Systemen.

Acknowledgement My thnks goes to Prof. Dr. Peter Snders nd my dvisor Timo Bingmnn for giving me the opportunity to work on the interesting subject of sorting strings in prllel. Not only is it subject of growing importnce, but lso n interesting lgorithmic problem to solve. Mny thnks go to Vlentin Zickner for ll the coffee drinking sessions nd the mny importnt discussions, on nd off-topic, s well s for mny dvices regrding L A TEX. Moreover, I wnt to thnk Vlentin, but lso Ktj Leppert, Jochim Lusirdi nd Kthrin Huber for proofreding this thesis nd giving me vluble input to improve it further. I would lso like to give thnks to my prents nd brothers for ll their support. Especilly, I wnt to thnk my older brother Christin Eberle, who hs not only been gret inspirtion through out ll my life, but lso ws the one introducing me to the world of computer progrmming.

Contents Contents 1. Introduction 13 1.1. Contributions of this Bchelor Thesis................... 13 1.2. Structure of this Bchelor Thesis..................... 14 2. Preliminries 15 2.1. Nottion nd Pseudo-code......................... 16 2.2. Existing Sorting Algorithms........................ 17 2.2.1. LCP-Mergesort by Wihong Ng.................. 17 2.2.2. ps 5 by Timo Bingmnn...................... 18 3. Prllel Multiwy LCP-Mergesort 19 3.1. Binry LCP-Mergesort........................... 19 3.1.1. LCP-Compre............................ 19 3.1.2. Binry LCP-Merge nd Binry LCP-Mergesort......... 21 3.1.3. Computtionl Complexity of Binry LCP-Mergesort...... 22 3.2. K-Wy LCP-Merge............................. 23 3.2.1. Simple Tournment Tree...................... 23 3.2.2. LCP-Awre Tournment Tree................... 24 3.2.3. K-Wy LCP Tournment Tree Exmple............. 27 3.3. Prlleliztion of K-Wy LCP-Merge................... 30 3.3.1. Clssicl Splitting with Binry Serch for Splitters....... 30 3.3.2. Binry Splitting........................... 32 3.3.3. Splitting by LCP Level....................... 33 4. Implementtion Detils 35 4.1. Tournment Tree nd K-Wy LCP-Merge................ 35 4.1.1. Ternry Comprison........................ 35 4.1.2. Memory Lyout of LCP Tournment Tree............ 35 4.1.3. Cching Distinguishing Chrcters................ 36 4.2. Prlleliztion of K-Wy LCP-Merge................... 38 4.3. Prllel K-Wy LCP-Mergesort...................... 39 4.4. NUMA Optimized ps 5........................... 40 4.5. Further Improvements........................... 41 4.5.1. Improved Binry Serch...................... 41 4.5.2. K-Wy LCP-Merge with Multi-Chrcter Cching........ 42 5. Experimentl Results 45 5.1. Experimentl Setup............................. 45 5.2. Input Dtsets............................... 46 5.3. Performnce of Splitting Methods..................... 47 5.3.1. Splitting Anlysis on Sorting 302 MiB Sinh DNA........ 48 5.3.2. Splitting Anlysis on Sorting 20 GiB URLs............ 50 5.4. Performnce of Prllel Algorithms.................... 51 6. Conclusions 57 6.1. Future Work................................. 57 A. Absolute Runtimes of Prllel Algorithms 59 9

List of Tbles List of Figures 1. NUMA rchitecture with m = 4 NUMA nodes nd p = 16 cores..... 15 2. Memory bndwidth for ccessing NUMA memory on IntelE5...... 16 3. Structure of string sequence S with ssocited LCP rry H....... 17 4. Illustrtion of cse 2 of LCP-Compre with h < h b............ 20 5. Structure of simple tournment tree with K = 4............. 23 6. Structure of LCP tournment tree with in nd output sequences, K = 4. 24 7. Binry Odd Even Tree with K = 8..................... 26 8. LCP-wre tournment tree exmple: prt 1............... 27 9. LCP-wre tournment tree exmple: prt 2............... 27 10. LCP-wre tournment tree exmple: prt 3............... 28 11. LCP-wre tournment tree exmple: prt 4............... 28 12. LCP-wre tournment tree exmple: prt 5 with winner pth P (red) 29 13. LCP-wre tournment tree exmple: prt 6............... 29 14. LCP-wre tournment tree exmple: prt 7............... 29 15. Splitting of three input sequences with splitters c, bb nd cdd..... 31 16. String sequence with LCP level (red line)................. 33 17. Different memory lyouts of n LCP-wre tournment tree....... 36 18. LCP-wre tournment tree with K = 4 plus LCP nd chrcter cching. 37 19. Scheme of Prllel K-wy LCP-Mergesort................. 39 20. Scheme of NUMA optimized ps 5...................... 40 21. Anlysis of splitting lgorithms on IntelE5 sorting 302 MiB Sinh DNA. 48 22. Anlysis of splitting lgorithms on AMD48 sorting 302 MiB Sinh DNA. 49 23. Anlysis of splitting lgorithms on IntelE5 sorting 20 GiB URLs..... 51 24. Anlysis of splitting lgorithms on AMD48 sorting 20 GiB URLs.... 52 25. Speedup of prllel lgorithm implementtions on IntelE5........ 53 26. Speedup of prllel lgorithm implementtions on AMD48........ 54 List of Tbles 1. Hrdwre chrcteristics of experimentl pltforms............ 45 2. Nme nd Description of tested prllel string sorting lgorithms.... 46 3. Chrcteristics of the selected input instnces............... 47 4. Absolute runtime of prllel lgorithms on IntelE5............ 59 5. Absolute runtime of prllel lgorithms on AMD48............ 60 10

List of Algorithms List of Algorithms 1. LCP-Compre................................ 20 2. Binry LCP-Merge.............................. 21 3. Binry LCP-Mergesort............................ 21 4. K-Wy-LCP-Merge............................. 26 5. Clssicl Splitting.............................. 32 6. LCP-Compre with Chrcter Cching................... 37 7. Improved Binry Serch........................... 41 8. String-Compre................................ 42 9. LCP-Compre Cching w Chrcters.................... 43 11

1 Introduction 1. Introduction With the digitl ge, more nd much lrger mounts of dt rise. Structuring, evluting nd nlysing this volume of dt is tsk of growing importnce nd difficulty. However, the bsic lgorithms needed to do this, hve been known nd used for yers. With mny of them requiring sorting dt nd merging results, it is quite comprehensible tht sorting is one of the most studied lgorithmic problems in computer science but nonetheless still of gret interest. Although the simplest sorting model ssumes tomic keys, sorting strings lexicogrphiclly nd merging sorted sequences of strings is required by mny lgorithms importnt for tody s pplictions. Exmples relying on string sorting rnge from MpRedcue tools nd dtbses over some suffix sorters to BigDt nlysis tools nd much more. In contrst to tomic keys, strings cn be seen s rrys of tomic keys, which leds to lrger computtionl complexity for string sorting. This is why it is very importnt to exploit the structure of keys to void repeted costly work on entire strings. Even though there is lrge mount of work on sequentil string sorting, only little work hs been done to prllelize it. But s nowdys the only wy to gin wins from Moore s lw, is to use prllelism, ll performnce criticl lgorithms need to be prllelized. However, with first prllel sorting lgorithms vilble, new chllenges rise. As the mount of vilble memory on modern mny-core systems grows, non uniform memory ccess (NUMA) rchitectures become more common. Curiously, lthough incresed min memory sizes reduce the need for externl sorting lgorithms on the one hnd, NUMA systems induce vrying min memory ccess times, thus mking it necessry to pply externl sorting lgorithm schemes to in-memory implementtions. As result, it is much more importnt to mximize efficiency of memory ccesses on NUMA systems. Exploiting known longest common prefixes (LCPs) when merging strings, cn be used to skip over lredy considered prts of them, which reduces memory ccesses. Merging sequences of strings with their ccording LCP informtion is n intuitive ide nd Ng nd Ktsuhiko [NK08] lredy introduced binry LCPwre merge sort but no multiwy implementtion ws found. However, s our NUMA systems currently hve two, four nd eight NUMA nodes, this is required to prevent unnecessry memory opertions. Moreover, n efficient multiwy LCP-wre merge llows to improve current sequentil nd prllel merge sort implementtions, possibly mking them competitive to currently fster lgorithms. Especilly for input sets with long verge LCPs, this implementtion could outperform others. 1.1. Contributions of this Bchelor Thesis As the first step of this work, LCP-Mergesort, initilly presented by Ng [NK08], will be redefined to improve comprehensibility of the newly presented lgorithms bsed on it. As Ng only showed n verge cse nlysis, the worst cse computtionl complexity of LCP-Mergesort will be nlysed. With the gol to crete fully prllel LCP-wre merge sort implementtion, Ng s binry LCP-Merge lgorithm is extended nd K-wy LCP-wre tournment tree introduced. This tournment tree is independently usble for merging K sorted sequences of strings with ssocited LCP informtion. Furthermore, prllel K-wy LCP-Merge nd the resulting fully prllel K-wy LCP-Mergesort is presented. Additionlly, com- 13

1 Introduction mon lgorithm for splitting the merge problem is dpted nd completely new one presented. Since we wnt to improve prcticl pplictions, it is of gret importnce to consider rel hrdwre rchitectures nd optimiztions required by them. Additionlly it is importnt tht these lgorithms do not just chieve good theoreticl results, but cn relly improve prcticl runtimes. Therefore we implemented our newly presented prllel LCP-Merge nd LCP-Mergesort with three different splitting procedures. Furthermore, the prllel sorting lgorithm ps 5 of Timo Bingmnn [BS13] will be improved for NUMA rchitectures by exploiting the properties of K-wy LCP-Merge. In order to evlute the presented lgorithms, they will be compred with existing prllel string sorting implementtions like originl ps 5. To llow exmintion of the degree of prllelism, not just runtimes but lso speed ups of different lgorithms re reviewed. 1.2. Structure of this Bchelor Thesis Section 2 gives n overview of used nottions nd existing lgorithms. Wheres Ng s LCP-Mergesort is the bsis for this work, Bingmnn s Prllel Super Sclr String Smple Sort is reference s one of the fstest prllel string sorters. In Section 3 binry LCP-Mergesort is redefined nd multiwy LCP-Merge, s well s multiwy LCP-Mergesort re introduced. Moreover, proof of the upper bound of binry LCP-Mergesort s runtime is provided. Furthermore, Section 4 focuses on implementtion detils of the newly presented lgorithms in order to improve their prcticl performnce even further. The performnce of the resulting C++ implementtions is evluted in Section 5, where speed up fctors nd runtimes of vrious vrints nd lgorithms re compred. Finlly, summtion of the results nd n outlook to future work, is given in Section 6. 14

2 Preliminries 2. Preliminries A set S = {s 1,..., s n } of n strings of totl length N = n i=1 s i is our input. A string s is one-bse rry of chrcters from the lphbet Σ = {1,..., σ}. The length of string s or ny rbitrry rry, is given by s nd the i th element of n rry is ccessed vi [i]. On the lphbet Σ we ssume the cnonicl ordering reltion < with 1 < 2 <... < σ. Likewise for strings we ssume the lexicogrphicl ordering reltion < nd our gol is to sort the strings of the given input sequence S lexicogrphiclly. For indicting the end of strings, our lgorithms require strings to be zero-terminted, mening s[ s ] = 0 / Σ, which however cn be replced by ny other end-of-string convention. With the length of the distinguishing prefix D, denoting the minimum number of chrcters to be inspected to estblish lexicogrphic ordering of S, there is nturl lower bound for string sorting. More precisely, for sorting bsed on chrcter comprisons, we get the lower bound of Ω(D + n log n), wheres string sorting bsed on n integer lphbet cn be chieved in Ω(D) time. Becuse sets of strings re usully represented s rrys of pointers to the beginning of the string, there is n dditionl indirection when ccessing string chrcter. This generlly cuses cche fult on every string ccess, even during liner scnning of n rry of strings. Therefore mjor difference of string sorting in comprison to tomic sorting, is the lck of efficient scnning. Our lgorithms re trgeted for shred memory systems supporting p processing elements or hrdwre threds on Θ(p) cores. Additionlly some lgorithms nd optimiztions re specilly trgeted for non uniform memory ccess (NUMA) systems, lso providing p hrdwre threds on Θ(p) cores. However, the p hrdwre threds re eqully divided onto m NUMA nodes, ech hving fst direct ccess to locl memory nd slower ccess to remote memory vi n interconnect bus system. Due to the NUMA rchitecture, costs of memory ccesses cross NUMA nodes re much higher nd therefore need to be voided. Figure 1 illustrtes NUMA rchitecture with m = 4 NUMA nodes nd p = 16 cores. Wheres the cores p0, p4, p8 nd p12, belonging to NUMA node 0, hve fst ccess to locl Memory 0, remote ccess to the memories of nodes 1, 2 nd 3 is much slower. NUMA Node 3 NUMA Node 2 Memory 3 p3 p11 p7 p15 p2 p10 p6 p14 Memory 2 Memory 0 p0 p8 p4 p12 p1 p9 p5 p13 Memory 1 NUMA Node 0 NUMA Node 1 Figure 1: NUMA rchitecture with m = 4 NUMA nodes nd p = 16 cores. 15

2 Preliminries 2 5 bndwidth [GiB/s] 2 4 2 3 2 2 2 1 2 0 hopcount=0 hopcount=1 hopcount=2 2 1 1 8 16 32 48 64 threds Figure 2: Memory bndwidth for ccessing NUMA memory on IntelE5. See Tble 1 on pge 45 for the exct hrdwre specifiction. This behviour cn be exmined in Figure 2, showing the memory bndwidth chieved by the given number of threds, when linerly reding 64 bit vlues from memory re, which is eqully segmented onto ll NUMA nodes. The curves show the memory bndwidth over the vilble threds when only ccessing the memory on the NUMA node tht is exctly hopcount steps wy. Therefore thred running on NUMA node n will solely write to the memory of node (n + hopcount) mod m. The figure clerly shows the tremendous gp in bndwidth between ccessing the locl NUMA memory (hopcount = 0) nd ccessing the other node s memories (hopcount = 1 or hopcount = 2). Since sorting mostly requires red opertions, the performnce of write opertions isn t displyed here. However, for write opertions, further slowdown is experienced, when reding from the memory positioned frthermost wy (hopcount = 2) in comprison to reding from direct neighbour node. More informtion on pmbw, the tool used for creting the mesurements of Figure 2, cn be found t http://pnthem.net/2013/pmbw/. For these tests, the NUMA brnch of pmbw hs been used to test the performnce of the function ScnRed64PtrSimpleLoop. 2.1. Nottion nd Pseudo-code To describe the lgorithms presented in this pper, we chose tuple pseudo-code lnguge, combining rry mnipultion, mthemticl set nottion nd Pscl-like control flow. Ordered sequences re written like rrys using squre brckets [x, y,...] nd + is extended to lso conctente rrys. Neither rrys nor vribles re declred beforehnd, so A[3] := 4 defines n rry A nd ssigns 4 to the third position, s rry indexes re counted from 1 to A, being the length of the rry. An exmple for powerful expressions possible with this pseudo-code lnguge is the following definition: A := [(k, exp(i k π )) k {0, 1, 2, 3}], specifying A to be n rry of the pirs 2 [(0, 1), (1, i), (2, 1), (3, i)]. In order to void mny specil cses, we use the following sentinels: ɛ is the empty string, being lexicogrphiclly smller thn ny other string, is the chrcter or string, which is lrger thn ny other, nd s symbol for undefined vribles. Furthermore, for rrys s nd t, let the symmetric function lcp(s, t) denote the length of 16

2.2 Existing Sorting Algorithms S s 1 s 2 s 3 s 4... s n H lcp(s 1, s 2 ) lcp(s 2, s 3 ) lcp(s 3, s 4 )... lcp(s n 1, s n ) () Structurl view S b cd cd bc bcd bbc H 2 4 0 3 1 (b) Exemplry configurtion Figure 3: Structure of string sequence S with ssocited LCP rry H. the longest common prefix (LCP) of s nd t. Thus, for one-bsed rrys, the LCP vlue denotes the lst index where s nd t equl ech other, wheres t index lcp(s, t)+1, s nd t differ, if tht position exists. Bsed on tht, lcp X (i) is defined to be lcp(x[i 1], X[i]) for n ordered sequence X. Accordingly, the ssocited LCP rry H = [, h 2,..., h n ] of sorted string sequence S = [s 1,..., s n ] is defined s h i = lcp S (i) = lcp(s[i 1], S[i]). Additionlly, for ny string s, we define lcp(ɛ, s) = 0 to be the LCP to the empty string ɛ. Figure 3 shows the structure of string sequence nd how its corresponding LCP rry is clculted. Furthermore Figure 3b illustrtes the LCP rry for the exmple string sequence S = [b, cd, cd, bc, bcd, bbc]. As the sum of ll elements (excluding the first) of n LCP rry H will often be used, we define L(H) = n i=2 H i or just L if H is cler in the context. The sum of the distinguishing prefixes D nd the sum of the LCP rry H re relted, but not identicl. Wheres D is the sum of the distinguishing prefixes, L only counts the length of LCPs nd lso misses the length for the first string, leding to D L. In the exmple shown in Figure 5b, we hve L = 2+4+0+3+1 = 10, wheres D = 3+3+5+1+4+2 = 18. 2.2. Existing Sorting Algorithms To begin with, n overview on existing sorting lgorithms is presented. Although there exists wide rnge of sorting lgorithms, this section focuses on two of them, being essentil preliminry work for this thesis. LCP-wre merge sort hs been introduced by Wihong Ng in [NK08] nd is bsis of this work. Timo Bingmnn s ps 5 [BS13] is prllel string sorting lgorithm tht chieved gret results in previous experiments nd will be further optimized by mking it NUMA-wre. More lgorithms cn be found in [BES14] nd [BS13], including but not limited to Multikey quicksort, MSD rdix sort, Burstsort, Smple sort nd Insertion sort. 2.2.1. LCP-Mergesort by Wihong Ng LCP-Mergesort is string sorting lgorithm introduced by Wihong Ng nd Ktsuhiko Kkehi [NK08]. It clcultes nd reuses the LCP of sorted sub-problems to speed up string sorting. Ng s binry LCP-Mergesort is redefined in more detil in Section 3.1. As prt of this section, the worst cse computtionl complexity of LCP-Mergesort will be shown to be in O(n log n + L). Lter, LCP-Mergesort s bsic step LCP-Compre 17

2 Preliminries will be reused s fundmentl prt of the new prllel K-Wy-LCP-Merge lgorithm presented in this bchelor thesis. A prllelized version of Ng s binry LCP-Mergesort hs been developed by Ngrj Shmsundr [Sh09]. The bsic ide is to run instnces of binry LCP-Mergesort on every thred for subsets of the input strings. As soon s two threds finished their work, their sorted result sequences re merged together sequentilly. Whenever nother thred finishes (nd no other thred is currently merging with the output sequence), its sequence is sequentilly merged with the output sequence. However, since the finl merging is done sequentilly, only the sorting of the sequences is prllelized. 2.2.2. ps 5 by Timo Bingmnn Prllel Super Sclr String Smple Sort (ps 5 ) introduced by Timo Bingmnn nd Peter Snders [BS13] is prllelized version of S 5, designed to mke use of the fetures of modern mny-core systems, hving individul cche levels but reltively few nd slow memory chnnels. The S 5 lgorithm is bsed on smple sort nd preliminry results cn be found in the bchelor thesis of Ssch D. Knöpfle [Knö12]. Prllel S 5 uses three different sub-lgorithms depending on the size of subsets of the input strings. Wheres for lrge subsets, sequentil S 5 implementtion is used, medium-sized inputs re sorted with cching multikey quicksort, which itself is internlly pplying insertion sort s bse cse sorter. In Section 4.4 our new prllel K-Wy-LCP-Merge lgorithm is used to improve the performnce of ps 5 even further on NUMA systems. 18

3. Prllel Multiwy LCP-Mergesort 3 Prllel Multiwy LCP-Mergesort Strting with the bsic components, this section introduces prllel multiwy LCP- Merge lgorithm, usble for esier prlleliztion of sorting lgorithms. Moreover, s direct ppliction, prllel multiwy LCP-Mergesort will be introduced. Bsed on tht, in Section 4 the prllel multiwy Merge is used for implementing NUMA-wre version of ps 5 nd more. 3.1. Binry LCP-Mergesort LCP-Merge is string merging lgorithm introduced by Ng nd Kkehi [NK08]. By utilizing the longest common prefixes of strings it is possible to reduce the number of needed chrcter comprisons. As Ng nd Kkehi show in their pper, this leds to n verge complexity of O(n log n) for string Mergesort, using the given LCP-Merge. Preceding the proof of O(n log n) complexity, this section focuses on reformulting LCP-Merge nd explicitly defining its comprison step LCP-Compre. Since these steps re fundmentl prts of the following work, rther verbose specifiction is used. This not only llows n esier reuse of the code in lter prts but lso helps to visulize the proof of computtionl complexity. 3.1.1. LCP-Compre LCP-Compre is the bsic LCP-wre comprison step used in ll lgorithms presented in this work. It is replcement for stndrd string comprison function, which usully itertes over the chrcters of string until mismtch is found. In order to improve runtime, LCP-Compre exploits the longest common prefixes clculted in previous steps. Like shown in Algorithm 1, LCP-Compre tkes two strings s nd s b nd the corresponding LCPs h nd h b to clculte the sort order of s nd s b, s well s the lcp(s, s b ). The given LCPs h i need to be the LCPs of their string s i with third common, lexicogrphiclly smller string. Therefore there must be string p with p s i nd h i = lcp(p, s i ) where i {, b}. Figure 4 visulizes the input prmeters of LCP-Compre nd their reltion to the common predecessor p. In Figure 4 it is ssumed tht h = lcp(p, s ) < lcp(p, s b ) = h b. In this sitution no chrcters need to be compred, since the lexicogrphicl order cn be clculted solely depending on the LCPs: let y = s [h + 1] nd x = p[h + 1] be the distinguishing chrcters of p nd s. Due to the precondition p s nd the definition of LCPs, we do not just know x y but lso x < y. However, due to h b > h, we further know the distinguishing chrcters of s nd s b to be y nd x = s b [h + 1] = p[h + 1] which leds to the conclusion s b < s. In order to effectively clculte the sort order nd LCP of s nd s b, LCP-Compre differentites three min cses: Cse 1: If both LCPs h nd h b re equl, the first h = h b chrcters of ll three strings p, s nd s b re equl. In order to find the distinguishing chrcters of s nd s b, the strings need to be compred strting t position h + 1. This is done by the loop in line 3. With the distinguishing chrcter found by the loop, the sort order cn be determined. Additionlly the lcp(s, s b ) = h is inherently clculted in the loop s by-product. 19

3 Prllel Multiwy LCP-Mergesort p s Input: s, s b, h = lcp(p, s ), h b = lcp(p, s b ) with p s nd p s b x x y y s b x chrcters Figure 4: Illustrtion of cse 2 of LCP-Compre with h < h b. Cse 2: If h < h b, s shown in Figure 4, the first h chrcters of the three strings p, s nd s b re equl. Becuse h nd h b re the LCPs to the common predecessor p, the chrcters t index h + 1 re the distinguishing chrcters between s nd s b. Due to p < s i nd h < h b follows p[h + 1] = s b [h + 1] nd p[h + 1] < s [h + 1]. This results in s b [h + 1] < s [h + 1] nd therefore s b < s. Cse 3: If h > h b, the sme rguments s in cse 2 cn be pplied in symmetriclly. Algorithm 1 combines these observtions to construct LCP-Compre, the bsic step of LCP-Mergesort nd the lter introduced K-Wy-LCP-Merge. The three distinct cses from bove, being the bsic prts of LCP-Compre, cn be seen in lines 1, 7 nd 8, wheres the chrcter comprison loop cn be found in line 3. To be ble to use LCP-Compre for Binry LCP-Merge nd LCP-Mergesort but lso for K-Wy-LCP-Merge, the function is written in rther generic wy. Tht s why the cller hs to specify the vlues nd b s keys, identifying the given strings s nd s b. Furthermore, LCP-Compre does not return the ordered input strings, but w, l {, b}, nd h w, h l the corresponding LCPs, so tht s s w s l nd respectively h w = lcp(p, s w ) nd h l = lcp(p, s l ). Algorithm 1: LCP-Compre Input: (, s, h ) nd (b, s b, h b ), with s, s b two strings, h, h b corresponding LCPs; ssume string p with p s nd p s b, so tht h = lcp(p, s ) nd h b = lcp(p, s b ). 1 if h = h b then // Cse 1: LCPs re equl 2 h := h + 1 3 while s [h ] 0 & s [h ] = s b [h ] do // Execute chrcter comprisons 4 h ++ // Increse LCP 5 if s [h ] s b [h ] then return (, h, b, h ) // Cse 1.1: s s b 6 else return (b, h b,, h ) // Cse 1.2: s > s b 7 else if h < h b then return (b, h b,, h ) // Cse 2: s > s b 8 else return (, h, b, h b ) // Cse 3: s < s b Output: (w, h w, l, h l ) where {w, l} = {, b} with p s w s l, h w = lcp(w, s) nd h l = lcp(s w, s l ) 20

3.1 Binry LCP-Mergesort 3.1.2. Binry LCP-Merge nd Binry LCP-Mergesort Bsed on LCP-Compre, LCP-Merge is given in Algorithm 2. The lgorithm tkes two sorted sequences of strings S 1 nd S 2 nd their LCP rrys H 1 nd H 2 to clculte the combined sorted sequence S 0 with its LCP rry H 0. Algorithm 2: Binry LCP-Merge Input: S 1 nd S 2 : two sorted sequences of strings, H 1 nd H 2 : the corresponding LCP rrys; ssume S 1 [ S 1 ] = S 2 [ S 2 ] = 1 i 0 := 1, i 1 := 1, i 2 := 1 2 h 1 := 0, h 2 := 0 // Invrint: h k = lcp(s k [i k ], S 0 [i 0 1]), k {1, 2} 3 while i 1 + i 2 < S 1 + S 2 do // Loop over ll input elements 4 (w,, l, h ) := LCP-Compre(1, S 1 [i 1 ], h 1, 2, S 2 [i 2 ], h 2 ) 5 (S 0 [i 0 ], H 0 [i 0 ]) := (S w [i w ], h w ) 6 i w ++, i 0 ++ 7 (h w, h l ) := (H w [i w ], h ) // re-estblish invrint Output: S 0 : sorted sequence contining S 1 S2 ; H 0 : the corresponding LCP rry Like usul merging lgorithm, the loop in line 3 of Algorithm 2 itertes s long s there re ny elements in S 1 or S 2 left. During ech itertion, the two current strings of the sequences re compred (line 4), the lexicogrphiclly smller one is written to the output sequence (line 5) nd the indexes of the output sequence nd the sequence with the smller element re incresed (line 6). In contrst to these common steps, LCP-Merge uses LCP-Compre insted of usul string comprison nd stores the LCP vlue of the winner in the output LCP rry H 0. This is importnt for the lter LCP-Mergesort implementtion, since further LCP-Merge steps lso require vlid LCP rrys of their input sequences. The LCP vlue of the loser, which is clculted by LCP-Compre, is stored in locl vrible nd used for the next itertion. The loop invrint, given in line 2, ensures tht LCP-Compre cn be pplied. However, becuse it cn only be pplied fter the first itertion, LCP-Compre s preconditions must be checked for the first itertion. This mens, the pssed LCP vlues h 1 nd h 2 need to refer to common lexicogrphiclly smller string p. As we initilize h 1 nd h 2 with 0 in line 2, setting p = ɛ fulfills these requirements. During ny itertion, the winner string is written to the output sequence with its corresponding LCP vlue being ssigned to the equivlent position of the LCP rry Algorithm 3: Binry LCP-Mergesort Input: S sequence of sorted strings; ssume S[ S ] = 1 if S 1 then // Bse cse 2 return (S[1], 0) 3 else 4 l 1/2 := S /2 5 S 1 = {S[1], S[2],..., S[l 1/2 ]}, S 2 := {S[l 1/2 + 1], S[l 1/2 + 2],..., S[ S ]} 6 (S 1, H 1) := LCP-Mergesort(S 1 ), (S 2, H 2) := LCP-Mergesort(S 2 ) 7 return LCP-Merge(S 1, H 1, S 2, H 2) Output: S 0 : sorted sequence contining S 1 S2 ; H 0 : the corresponding LCP rry 21

3 Prllel Multiwy LCP-Mergesort in line 5. In order to restore the invrint, the locl LCP vlues re updted in line 7. Wheres the winner s new locl LCP vlue is loded from the winner s input LCP rry, the loser s one is tken from the result of LCP-Compre. Therefore the invrint holds true for the winner, due to the definition of LCP rrys nd for the loser, due to the postcondition of LCP-Compre. With the given binry LCP-Merge lgorithm, binry LCP-Mergesort cn be implemented s shown in Algorithm 3. 3.1.3. Computtionl Complexity of Binry LCP-Mergesort Although LCP-Mergesort ws introduced first by Ng nd Kkehi [NK08], they did not provide worst cse nlysis. However, their verge cse nlysis shows the computtionl complexity of LCP-Mergesort to remin O(n log n) on verge, wheres the complexity of stndrd recursive string Mergesort tends to be greter thn O(n log n). In this section the worst cse computtionl complexity of LCP-Mergesort will be nlysed nd shown to be in O(n log n + L) Clerly the number of string comprisons of LCP-Mergesort (i.e. clls of LCP-Compre) is equl to the number of comprisons of Mergesort with tomic keys nd therefore in O(n log n). However, in difference to Mergesort with tomic keys, LCP-Compre needs to compre strings, which in generl requires more thn single comprison to determine the sort order. In the following the number of comprisons required in ech cse of LCP-Compre shll be counted: Whenever LCP-Compre is clled, there need to be integer comprisons of two LCPs to determine the cse to select. The three cses cn be determined with mximum of two integer comprisons, resulting in n symptoticlly constnt cost for this step. Following this, cses two nd three do not require ny more clcultions nd cn immeditely return the result. However, in cse one, the chrcter compring loop (line 3 of Algorithm 2) is executed strting with the chrcter t position h + 1. If both chrcters re found to be equl, h is incresed by one nd s it is lter set to be the new LCP of the loser (line 7) the overll LCP vlue is incresed by one, respectively. Becuse of LCP vlues never getting dropped or decremented, this cse my only occur L times in totl, with L being the sum of ll LCPs. If the chrcters re not equl, the loop is terminted nd the result cn be returned. Like before, the three comprisons in lines 3, 5 nd 6 re counted s one ternry comprison. Since this cse termintes the loop, it occurs exctly s often s cse 1 is entered. However, this is limited by the times LCP-Compre is clled, which is in O(n log n). But s this is only n upper bound, for most string sets, cses two nd three (see Section 3.1.1) reduce the number of times cse one is entered. In conclusion, LCP-Mergesort s computtionl complexity is shown to hve the following upper bound, where c i is the number of integer nd c c the number of chrcter comprisons: O((n log n)c i + (L + n log n)c c )) = O(n log n)c i + O((n log n + L))c c = O(n log n + L) comprisons. In their verge cse nlysis, Ng nd Kkehi [NK08] show, the totl number of chrcter comprisons to be bout n(µ 1) + P ω n log 2 n where µ is the verge length of 22

3.2 K-Wy LCP-Merge the distinguishing prefixes nd P ω the probbility of entering cse one in LCP-Compre (Algorithm 1). Assuming P ω = 1 nd µ = D their result mtches the worst-cse up to n the minor difference between D nd L. 3.2. K-Wy LCP-Merge In order to improve cche efficiency nd s preliminry work for prllel multiwy LCP- Mergesort nd NUMA optimized ps 5, K-wy LCP-Merge ws developed. A common nd well-known multiwy merging method is to use binry comprison to construct tournment tree, which cn be represented s binry tree structure [Knu98]. Although this llows efficient merging of multiple strems of sorted inputs, no implementtion of LCP-wre tournment tree ws found in literture. 3.2.1. Simple Tournment Tree Multiwy merging is commonly seen s selecting the winner of tournment of K plyers. This tournment is orgnized in binry tree structure with the nodes representing mtch between two plyers. Although there lso is the possibility to represent tournment tree s winner tree, for our implementtions, loser tree is more intuitive. Therefore, the loser of mtch is stored in the node representing the mtch, wheres the winner scends to the prent node nd fces the next gme. With this method repetedly pplied, n overll winner is found nd usully plced on top of the tree in n dditionl node. We do not consider the plyers s prts of the ctul tournment tree, since they re only used here to ese comprehensibility nd not needed in ctul code. Therefore the tournment tree hs exctly K nodes nd the nodes reference every plyer exctly once. Figure 5 shows the structure of simple tournment tree with K = 4. As visulized, in node v of the tournment tree, the index of the input strem n[v] of the corresponding mtch s loser, rther thn the ctul string, or reference of it, is stored. In the exemplry configurtion, shown in Figure 5b, the strings b, c, bc nd compete to become the overll winner. The winner s pth P from its plyer s node to the top is shown in red colours, becuse it will be of importnce for selecting the next winner. However, before the first winner cn be selected, n initil round needs to be plyed with ll plyers strting from the bottom of the tree. Since the winners, in this cse the lexicogrphiclly smller strings, of the first level scend to the level bove, the next mtches re plyed. After the topmost level is reched, the first overll winner is found nd therefore is the smllest string. During this initil round ll mtches, represented by the nodes of the tree, need to be plyed exctly once. As the tree contins exctly K Winner (n[1]) Losers (n[2]) (n[3]) (n[4]) Plyers (s[1]) (s[2]) (s[3]) (s[4]) () Structurl view Winner (4) Losers (1) (2) (3) Plyers (b) (c) (bc) () (b) Exmple with red winner Pth P Figure 5: Structure of simple tournment tree with K = 4. 23

3 Prllel Multiwy LCP-Mergesort nodes, K comprisons need to be executed. The initiliztion phse is further illustrted with n exmple of LCP-wre tournment tree in Figures 8 to 11. After the initil round is finished, only log 2 K mtches need to be plyed to determine the next winner nd therefore the next string to be written to the output. This cn be chieved by first replcing the current winner plyer with the next string on its corresponding input sequence. In order to find the winner of the new set of plyers, ll gmes long the red pth P in Figure 5b of the former winner, must be replyed. Thus the new plyer needs to ply the first mtch strting t the bottom of the tree with the former loser of tht mtch. Agin, whoever loses the mtch stys t tht node representing the mtch, wheres the winner scends to the next level. Since the binry tree hs log 2 K levels, the new overll winner is found with log 2 K comprisons. The steps for replying the tournment fter removing the current winner, re lso further illustrted in the exmple of LCP-wre tournment tree in Figures 11 to 13. Repetedly pplying this process until ll input strems re emptied, relises the K-wy merge. Assuming sentinels for empty inputs, specil cses cn be voided. Furthermore, K cn be ssumed to be power of two, since missing sequences cn esily be represented by empty strems. Hence, the tournment tree cn be ssumed s perfect binry tree. Due to using one-bse rrys, trversing the tree upwrds, tht mens, clculting the prent p of node v, cn effectively be done by clculting p = v 2. This leds to very efficient implementtion to find the pth from plyer s lef to the root of the tree. 3.2.2. LCP-Awre Tournment Tree In this section, focus is put on extending the simple tournment tree, described in the section before, to LCP-wre tournment tree. First of ll, to reduce the number of chrcter comprisons done during the mtches, we use LCP-Compre (see Section 3.1.1) to exploit input sequences LCP rrys. Becuse we wnt to prevent chrcter comprisons we lredy know to be equl, we lso store LCP vlue h[v] in the node longside the index to the losers input sequence. The vlue stored in h[v] is the LCP of the prticipnts of the mtch of node v. Figure 6 visulizes the structure of the new LCP-wre tournment tree. Additionlly to winner, loser nd plyer nodes lredy shown in Figure 5 the input nd output sequences hve been dded s well. These will be useful in the exmple illustrted in Section 3.2.3. As pictured in Figure 6, the nodes of the LCP-wre tournment tree now contin the Output (H 0 [1], S 0 [1]) Winner (h[1], n[1] = w) Losers (h[2], n[2]) (h[3], n[3]) (h[4], n[4]) Plyers (h 1, s 1 ) (h 2, s 2 ) (h 3, s 3 ) (h 4, s 4 ) Inputs (H 1 [1], S 1 [1]) (H 2 [1], S 2 [1]) (H 3 [1], S 3 [1]) (H 4 [1], S 4 [1]) (H 1 [2], S 1 [2]) (H 2 [2], S 2 [2]) (H 3 [2], S 3 [2]) (H 4 [2], S 4 [2]) Figure 6: Structure of LCP tournment tree with in nd output sequences, K = 4. 24

3.2 K-Wy LCP-Merge LCP vlue h[v] longside n[v], the index to the input sequence of the corresponding mtch s loser. The plyers of the tournment re the first elements of the remining input sequences. Since we now describe the process, which will be summrized in Algorithm 4, nd to emphsize their position s prticipnts of the tournment, they re referred to s plyers nd kept in n dditionl rry. Just like with the simple tournment tree of Figure 5, only the winner nd loser nodes re ctully prt of the tree. Therefore the LCP-wre tournment tree hs exctly K nodes. As well s the stndrd tournment tree, the LCP-wre tournment tree lso needs to be initilized first. Like mentioned before, LCP-Compre is used to replce the stndrd compre opertion. However, LCP-Compre does not just need two strings s prmeters, but lso two LCPs to common lexicogrphiclly smller string. For the process of tree initiliztion, these LCPs re lwys 0 nd the common bse string is ɛ. Therefore the preconditions of LCP-Compre re fulfilled nd it cn be pplied to compre the given strings like norml string comprison procedure. In order to extrct the second winner, we need to mke sure, the preconditions of LCP-Compre re fulfilled fter the first initil round hs been completed. Let w = n[1] be the index of the input sequence of the current overll winner, which is to be removed. Exctly s with the simple tournment tree, it is cler, tht w won ll mtches long the pth P from its lef to the top. Therefore ll LCP vlues h[v], stored in the nodes long this pth, re given by h[v] = lcp(s n[v], s w ) nd it is true tht s w s n[v], v P. Let s w be the successor of the input sequence with index w. Then the definition of LCP rrys specifies the corresponding LCP of the input sequence to be h w = lcp(s w, s w) nd s w s w. Combining these observtions one cn determine tht ll strings tht might get compred by LCP-Compre, i.e. tht re long pth P, hve the common predecessor s w nd ll the used LCP vlues refer to s w. Therefore the correctness of the preconditions of LCP-Compre is ensured. Likewise it needs to be shown tht fter n winners hve been removed, the next one cn lso be removed nd the mtches hd been replyed s described. However, the exct sme rgument cn be pplied gin nd so merging K sequences with K-Wy-LCP-Merge works s desired. Pseudo code of K-Wy-LCP-Merge cn be seen in Algorithm 4. To refine the clcultions done in Algorithm 4, we will first focus on the implementtion of the initiliztion phse relized by the loop in line 2. The functionlity of the loop is bsed on viewing the tournment tree s perfect binry odd-even-tree like shown in Figure 7, where the colours visulize the prity of the indexes written in the nodes. During the initiliztion phse, the loop itertes over ll plyers, strting from index v = 1 nd lets them ply s mny mtches s there re currently vilble. Therefore in the first itertion of the loop the string of plyer k = 1 is to be positioned in the tree. Due to line 4, this results in v = K + k being odd. Therefore the inner loop is not clled nd the index of the string is directly written to the odd node with index v = K+k = 5 in Figure 7. 2 In the second itertion with k = 2, the inner loop in line 5 is plyed once s v = 10 is even before the first itertion nd odd the next time. However, the comprison is done with the odd node v = 10 = 5. After the inner loop finished, the index of the previous 2 gme s winner is written to the next prent node. To sum it up, comprisons need to be done t the prents of ll even nodes (this time including the plyer nodes). The remining winner of the lst comprison then hs to be written to the next prent node, which is done in line 9. To ensure the correctness 25

3 Prllel Multiwy LCP-Mergesort Algorithm 4: K-Wy-LCP-Merge Input: S k sorted sequences of strings, H k the corresponding LCP rrys; ssume sentinels S k [ S k + 1] =, k = 1,..., K nd K being power of two. 1 i k := 1, h k := 0, k := 1,..., K 2 while k = 1,..., K do // Ply initil gmes 3 s[k] := S k [1] 4 x := k, v := K + k 5 while v is even & v > 2 do 6 v := v 2 7 (x,, n[v], h[v]) := LCP-Compre(x, s[x], 0, n[v], s[n[v]], 0) 8 v := v 2 9 (n[v], h[v]) := (x, 0) 10 j := 1 11 while j K k=1 S k do // Loop over ll elements in inputs 12 w := n[1] // Index of the winner of lst round 13 (S 0 [j], H 0 [j]) := (s[w], h[1]), j++ // Write winner to output 14 i w ++, s[w] := S x [i x ] 15 v := K + w, (x, h ) := (w, H w [i w ]) // v index of contested, x index of contender 16 while v > 2 do // Trverse tree upwrds nd ply the gmes 17 v := v 2 // Clculte index of contested 18 (x, h, n[v], h[v]) := LCP-Compre(x, s[x], h, n[v], s[n[v]], h[v]) 19 (n[1], h[1]) := (x, h ) // Now the tournment tree is complete gin Output: S 0 : sorted sequence contining S 1 S2 ; H 0 : the corresponding LCP rry Winner 1 2 Losers 3 4 5 6 7 8 Plyers 1 2 3 4 5 6 7 8 winner, odd odd even Figure 7: Binry Odd Even Tree with K = 8. of this procedure, ll nodes used for comprisons need to be lredy initilized nd the lst prent node p k of the itertion for plyer k needs to be empty before the run. From Figure 7 one cn esily see tht even nodes re lwys the right child of their prents, wheres odd nodes re lwys the left child, except for node 2 s node 1 is specil cse. Let v e be the even, v o the odd child of the prent node v p. The prent s left sub-tree, with v o on its top, must lredy be fully initilized since the initiliztion strts from the left side nd ll lefs in tht sub-tree hve lower plyer index. Becuse the left sub-tree is lredy initilized, the mtch of v o ws lredy plyed nd so its winner s index hs been stored in v p, which therefore is initilized nd cn be used for comprison with the winner of v e. When looking t the sving of the lst winner in line 9, we need to check tht this node is not initilized yet, s otherwise it would be overwritten. Here, similr rgument cn be used. Since the lst node being compred is n odd node v o (except for node 2), its complete sub-tree is initilized. However, no 26

3.2 K-Wy LCP-Merge plyers positioned right of this sub-tree hve been worked yet nd so the right child of the prent of v o cn not be set yet either. 3.2.3. K-Wy LCP Tournment Tree Exmple The following exmple shll be used for further illustrtion of how K-Wy LCP tournment tree, implicitly used for K-Wy-LCP-Merge (Algorithm 4), is constructed during the initiliztion phse nd rebuild fter the current minimum hs been removed. The exmple uses tournment tree with K = 4 input sequences nd its structure is oriented on the structurl view of the tree, shown in Figure 6. The four sequences contin the following strings with corresponding LCPs: Sequence 1: b nd b with n LCP of 1; Sequence 2: c nd d with n LCP of 2; Sequence 3: bc nd c with n LCP of 0; Sequence 4: nd cb with n LCP of 1. Figure 8 illustrtes the stte before the initiliztion of the tree strted. The sorted input sequences with the pproprite LCPs re shown t the bottom, plyers nd the tree s nodes re not initilized yet. Losers Output () Winner (, ) (, ) (, ) (, ) Plyers (, ) (, ) (, ) (, ) Inputs (, b) (, c) (, bc) (, ) (1, b) (2, d) (0, c) (1, cb) Figure 8: LCP-wre tournment tree exmple: prt 1 Figure 9 shows the stte fter the first itertion of the initiliztion loop in line 2 the first plyer nd its prent tree node re initilized. The LCP in the tree node hs been set to 0, becuse it is the LCP to the string ɛ, which is lexicogrphiclly smller common string to ll plyers. Losers Output () Winner (, ) (, ) (0, b) (, ) Plyers (, b) (, ) (, ) (, ) Inputs (, b) (, c) (, bc) (, ) (1, b) (2, d) (0, c) (1, cb) Figure 9: LCP-wre tournment tree exmple: prt 2. In Figure 10 the tree s stte fter the second run of the initiliztion loop in line 2 is visulized. The string b won the mtch with c nd moved upwrds to the next 27

3 Prllel Multiwy LCP-Mergesort free position, wheres c stys t the loser position with its current LCP h[3] being set to 2. Losers Output () Winner (, ) (0, b) (2, c) (, ) Plyers (, b) (, c) (, ) (, ) Inputs (, b) (, c) (, bc) (, ) (1, b) (2, d) (0, c) (1, cb) Figure 10: LCP-wre tournment tree exmple: prt 3 The tournment tree s stte fter the third initiliztion step is shown in Figure 11. The first string of the third input sequence moved up to its prent node. However, since the strem s index is uneven, the string cn directly be plced in the mtch s node nd does not need to be compred, s no other string cn be there, yet. Losers Output () Winner (, ) (0, b) (2, c) (0, bc) Plyers (, b) (, c) (, bc) (, ) Inputs (, b) (, c) (, bc) (, ) (1, b) (2, d) (0, c) (1, cb) Figure 11: LCP-wre tournment tree exmple: prt 4 Figure 12 shows the fully initilized tree fter the fourth initiliztion step, which is the tree s stte, just before the loop in line 11 of Algorithm 4 is entered. During this lst step, the string is first compred with bc. Becuse is lexicogrphiclly smller, it scends the tree to ttend the next mtch, wheres bc stys t the mtch s node with the common LCP lcp(, bc) = h[4] = 0. As lso wins the mtch with b, it is written to the root of the tree nd b stys t the loser position with the new LCP lcp(, b) = h[2] = 2. The red line illustrtes the winner s pth to the top of the tree. 28

3.2 K-Wy LCP-Merge Output () Winner (0, ) Losers (2, b) (2, c) (0, bc) Plyers (, b) (, c) (, bc) (, ) Inputs (, b) (, c) (, bc) (, ) (1, b) (2, d) (0, c) (1, cb) Figure 12: LCP-wre tournment tree exmple: prt 5 with winner pth P (red) The intermedite stte fter the first winner hs been removed nd written to the output strem, is displyed in Figure 13. Since the winner s input strem hs moved forwrd, the string cb replces the former winner. The LCP of cb is tken from the LCP rry of the input strem s it directly refers to. With this steps done up to line 11 of Algorithm 4, the new set of plyers is complete nd redy to compete with ech other. Losers Output Winner (0, ) (, ) (2, b) (2, c) (0, bc) Plyers (, b) (, c) (, bc) (1, cb) Inputs (, b) (, c) (, bc) (1, cb) (1, b) (2, d) (0, c) (0, ) Figure 13: LCP-wre tournment tree exmple: prt 6 After the inner loop in line 16 of Algorithm 4 finishes, the sitution shown in Figure 14 is chieved. During the itertions, the following mtches were plyed: cb won ginst bc nd b won the mtch with cb. Both mtches where determined by the LCP vlues. Therefore not single chrcter comprison ws needed nd the effect of exploiting the LCPs in LCP-Compre becomes visible. Losers Output Winner (0, ) (2, b) (1, cb) (2, c) (0, bc) Plyers (, b) (, c) (, bc) (1, cb) Inputs (, b) (, c) (, bc) (1, cb) (1, b) (2, d) (0, c) (0, ) Figure 14: LCP-wre tournment tree exmple: prt 7 29

3 Prllel Multiwy LCP-Mergesort 3.3. Prlleliztion of K-Wy LCP-Merge This section focuses on prlleliztion of K-Wy LCP-Merge, merging K sorted input sequences of strings with their corresponding LCP rrys. When trying to solve problems in prllel, common pproch is to split-up the work into sub-tsks, process the sub-tsks in prllel nd in the end, put the pieces bck together. Applying this to sorting, one cn let ny sequentil sorting lgorithm work on prts of the input in prllel. However, merging the resulting sorted sequences cn not be prllelized without significnt overhed needed to split up the work into work disjoint subtsks [Col88]. Insted of being ble to simply cut the input sequences into pieces, the merging problem needs to be divided into disjoint prts, s commonly done in prcticl prllel merge sort implementtions [AS87], [SSP07]. One well-known wy to ccomplish prtitioning for tomic merge sort, is to smple the sorted input sequences to get set of splitters. After they hve been sorted, they cn ech be serched (e.g. vi binry serch) in ll the input sequences. The positions found for splitter define splitting points, seprting disjoint prts of the merging problem. This pproch is directly dpted to our LCP-wre multiwy string merging lgorithm in Section 3.3.1. In the following we refer to this splitting method, creting multiple work disjoint prts in single run, s clssicl splitting. As simplifiction of clssicl splitting, binry splitting, creting only two jobs in run, is introduced. Here we do not smple nd split for severl splitters, but for just single splitter. This pproch is explined in more detil in Section 3.3.2. In Section 3.3.3 new splitting lgorithm is defined. By exploiting LCP rrys of the input sequences to find splitting points, it is possible to lmost fully void rndom memory ccesses to chrcters of strings normlly cusing significnt mount of cche fults. Another wy to split the input sequences of n tomic merge into exctly p equl-sized rnge-disjoint prts ws proposed by Vrmn et l. [PJV91]. Although their lgorithm llows to crete eqully-sized prts with tomic keys, this pproch is not sufficient for string merging. Sttic lod blncing is not n efficient solution, due to the vrying cost of n equl number of string comprisons, depending on the length of distinguishing prefixes. Therefore, oversmpling (creting more tsks thn processing units vilble) nd dynmic lod blncing is required. Since the benefit of exct splitting only ppers with tomic keys, the lgorithm hs not been considered ny further in this work. Insted, the sme lightweight dynmic lod blncing frmework s for ps 5 [BS13] is used. Every thred currently executing merge job, regulrly checks if ny threds re idle s no jobs re left in the queue. In order to reduce blncing overhed the threds execute this check only bout every 4000 outputted strings. If n idle processing unit is detected by thred, its K-wy merge job is further split up into new jobs by pplying the heuristic bove. 3.3.1. Clssicl Splitting with Binry Serch for Splitters As described in the previous section, the merge problem cn not esily be divided into disjoint sub-tsks. One widely used pproch to crete rnge-disjoint prts is to seprte the elements of the input sequences by smpled splitters. After sorting these splitters, binry serch cn be used to find the splitting positions. The bsic principle behind this lgorithm is tht n rbitrry string cn be used to split up sequence of strings into two rnge-disjoint pieces. To do so with given splitter 30