COMBINATORIAL PATTERN MATCHING

Size: px

Start display at page:

Download "COMBINATORIAL PATTERN MATCHING"

Kristin Fields
6 years ago
Views:

1 COMBINATORIAL PATTERN MATCHING

2 OUTLINE: EXACT MATCHING Tabulating patterns in long texts Short patterns (direct indexing) Longer patterns (hash tables) Finding exact patterns in a text Brute force (run time) Efficient algorithms (pattern preprocessing) Single pattern: Knuth-Morris-Platt Multiple patterns: Aho-Corasick algorithm Efficient algorithms (text preprocessing) Suffix trees Burrows Wheeler Transform-based

3 OUTLINE: APPROXIMATE MATCHING Algorithms for approximate pattern matching Heuristics behind BLAST Statistics behind BLAST Alternatives to BLAST: BLAT, PatternHunter etc.

4 STRING ENCODING It is often necessary to index strings; a convenient way to do this is first to convert strings to integers. Given a string s of length n on alphabet A (0..c-1), with c= A characters, we can define a map code(s) [0, ), as code(s) s[1]c n 1 + s[2]c n s[n 1]c + s[n] There are c L different L-mers, but at most n-l+1 different L-mers in a text of length n A 0 C 1 G 2 T 3 AGT A=0*16 G=2*4 T=3 11 ATA A=0*16 T=3*4 A=0 12 TGG T=3*16 G=2*4 G=2 58

5 TABULATING SHORT PATTERNS If the L is small (e.g. 3 or 4), i.e. the total number of patterns is not too large and many of them are likely to be found in the input text then we could use direct indexing to tabulate/locate strings efficiently The distribution of short strings in genetic sequences is biologically informative, e.g. Synonymous codons (triplets of nucleotides, 64 patterns) are often used preferentially in organisms (transcriptional selection, secondary structure, etc) The distribution of short nucleotide k-mers (e.g. L=4, 256 patterns) can be useful for detecting horizontal (from species to species) gene transfer and gene finding The location of short amino-acid strings (e.g. L=3, 8000 patterns) is useful for finding seeds for BLAST

6 SHORT PATTERN SCAN Data : Alphabet A, Text T, pattern length p Result: Frequency of each pattern in text R array( A p ); n len(t ); for i:=1 to n-p+1 do R [code (T [i : i + p 1])] + = 1; end return R; O(L): naive O(1): if using the previous code to compute the current one

7 TABULATING/LOCATING LONGER PATTERNS Finding repeats/motifs: ATGGTCTAGGTCCTAGTGGTC Flanking sequences in genomic rearrangements Motifs: promoter regions, functional sites, immune targets Cellular immunity targets in pathogens (e.g. protein 9 mers) There are too many patterns to store in an array, and even if we could, then the array would be very sparse E.g. ~512,000,000,000 amino-acid 9-mers, but in an average HIV-1 sequence (~3 aa. kb long) there are at most ~3000 unique 9-mers

8 HASH TABLES Allow to efficiently (O(1) on average) store and retrieve a small subset of a large universe of records. Hash tables implement associative arrays (dictionaries) in a variety of languages (Python, Perl etc) The universe (records): e.g. 512,000,000,000 amino-acid 9-mers Hash function: record hash key Note: because there are more keys than array indices, this function is NOT one to one The storage: Hash Table (array) << the size of the universe

9 A SIMPLE HASH FUNCTION A reasonable hash function (on integer records i) is: i i mod P P is a prime number and also the natural size of the hash table Hash keys range from o to P-1 If the records are uniformly distributed, so will be their hash keys P=101 4-mer (256 possible) Integer code Hash Key ACGT CCCA TGCC COLLISION

10 COLLISIONS Collisions are frequent even for sparsely populated lightly loaded hash tables load level α = (number of entries in hash table)/(table size) The birthday paradox: what is the probability that two people out of a random group of n (<365) people share a birthday (in hash table terms, what is the probability of a collision if people=records and hash keys=birthdays)? P (n) = n n α P(n)

11 DEALING WITH COLLISIONS Several strategies to deal with collisions: the simplest one is chaining Each hash key is associated with a linked list of all records sharing the hash key Hash Key 0 CGCC AAAA Hash Key 1 Hash Key 2... AAAC 4-mer (256 possible) Integer code Hash Key AAAA 0 0 AAAC 1 1 CGCC 101 0

12 HASH TABLE PERFORMANCE Retrieving/storing a record in a hash table of size m with load factor α Worst case - all records have the same key: O(m) Expected run time is O (1), assuming uniformly distributed records and hash keys Record is not in the table Record is in the table EN = e α + α + O (1/m) ES =1+α/2+O (1/m) This is because the probability of having many collisions with the same key is quite low (even though the probability of SOME collisions in high)

13 EXACT PATTERN MATCHING Motivation: Searching a database for a known pattern Goal: Find all occurrences of a pattern in a text Input: Pattern P = p[1] p[n] and text T = t[1] t[m] (n m) Output: All positions 1< i < (m n + 1) such that the n-letter substring of text T[i][i+n-1] starting at i matches the pattern P Desired performance: O(n+m)

14 BRUTE FORCE PATTERN MATCHING Data : Pattern P, Text T Result: The list of positions in T where P occurs n len(p ); m len(t ); for i:=1 to m-n+1 do if T[i:i+n-1] = P then output i; end end Substring comparison can take from 1 to n (left-to-right) string comparisons Text: GGCATC; Pattern: GCAT i=1 (2 comparisons) G G C A T C G C A A i=2 (4 comparisons) G G C A T C G C A T i=3 (1 comparison) G G C A T C G C A T

15 BRUTE FORCE RUN TIME Worst case: O(nm). This can be achieved, for example, by searching for P=AA...C in text T=AA...A, because each substring comparison takes exactly n steps Expected on random text: O(1). This is because the substring comparison takes on average 1 q n comparisons (q = 1/alphabet size) 1 q For n = 20 and q = 1/4 (nucleotides), substring comparison will take on average 4/3 operations. Genetic texts are not random, so the performance may degrade.

16 IMPROVING THE RUN TIME The search pattern can be preprocessed in O(n) time to eliminate backtracking in the text and hence guarantee O(n+m) run time A variety of procedures, starting with the Knuth-Morris-Pratt algorithm in 1977, take this approach. Makes use of the observation that if a string comparison fails at pattern position i, then we can shift the pattern i-b(i) positions, where b(i) depends on the pattern and continue comparing at position the same or the next position in the text, thus avoiding backtracking. These types of algorithms are popular in text editors/mutable texts, because they do not require the preprocessing of (large) text A C A A C G A C A C G A C C A C A A C A G C A A T G A C G A C A C G A C A C A SHIFT A C A A C G A C A C G A C C A C A A C A G C A A T G A C G A C A C G A C A C A

17 EXACT MULTIPLE PATTERN MATCHING The problem: given a dictionary of D patterns P 1,P 2,..., P D (total length n) and text T report all occurrences of every pattern in the text. Arises, for instance when one is comparing multiple patterns against a database Assuming an efficient implementation of individual pattern comparison, this problem can be solved in O(Dm+n) time by scanning the text D times. Aho and Corasick (1975) showed how this can be done efficiently in O(m+n) time. Uses the idea of a trie (from the word retrieval), or prefix trie Intuitively, we can reduce the amount of work by exploiting repetitions in the patterns.

18 PREFIX TRIE Patterns: ape, as, ease. Constructed in O(n) time, one word at a time. Root Root Root Properties of a trie a a a e Stores a set of words in a tree 1 2 p 2 p 1 s 4 2 p 1 4 s 5 6 a Each edge is labeled with a letter Each node labeled with a state (order of creation) 3 e 3 e 3 e 7 s Any two edges sharing a parent node have distinct labels 8 e Each word can be spelled by tracing a path from the root to a leaf

19 SEARCHING TEXT FOR MULTIPLE PATTERNS USING A TRIE: THREADING Suppose we want to search the text appease for the occurrences of patterns ape, as and ease, given their trie. The naive way to do it is to thread (i.e. spell the word using tree edges from the root) the text starting at position i, until either: A leaf (or specially marked terminal node) is reached (a match has been found) Spelling cannot be completed (no match)

20 I=1: NO MATCH APPEASE I=4: MATCH APPEASE I=5: MATCH APPEASE Root Root Root a e a e a e p s a p s a p s a p e s e s e s X 3 7 e 3 e 7 e 3 7 e 8 But we already knew this, because as 8is a part ease! If we take advantage of this, there will be no need to backtrack in the text, and the algorithm will run in O(n+m). The Aho-Corasick algorithm implements exactly this idea using a finite state automaton starting with the trie and adding shortcuts 8

21 SUFFIX TREES A trie that is built on every suffix of a text T (length m), and collapses all interior nodes that have a single child is called a suffix tree. A very powerful data structure, e.g. given a suffix tree and a pattern P (length n), all k occurrences of P in T can be found in time O(n +k), i.e. independently of the size of the text (but it figures into the construction cost of tree T) A suffix tree can be built in linear time O (m)

22 BUILDING A SUFFIX TREE Example bananas#. It is convenient to terminate the text with a special character, so that no suffix is a prefix of another suffix (e.g. as in banana). This guarantees that spelling any suffix from the root will end at a leaf. Construct the suffix tree in two phases from the longest to the shortest suffix: Phase 1: Spell as much of the suffix from the root as possible Phase 2: If stopped in the middle of an edge, break the edge and add a new branch spell the rest of the suffix along that branch. Label the leaf with the starting position of the suffix.

23 BANANAS# ANANAS# NANAS# ANAS# Root Root Root Root bananas# bananas# ananas# bananas# ananas# nanas# bananas# ana nanas# N1 3 nas# s# 2 4 NAS# AS# S# AND # Root Root Root bananas# a na bananas# a na s# # bananas# ana na 1 N3 N2 1 N3 N N1 N2 na s# nas# s# na s# nas# s# nas# s# nas# s# N N nas# s# nas# s#

24 SUFFIX TREE PROPERTIES Exactly m leaves for text of size m (counting the terminator) Each interior node has at least two children (except possibly the root); edges with the same parent spell substrings starting with different letters. bananas# a Root na s# # The size of the tree is O(m) 1 N3 N2 7 8 Can be constructed in O(m) time This uses the obser vation that during construction, not every suffix has to be spelled all the way from the root (which would lead to quadratic time); suffix links can short circuit the process 2 N1 nas# s# 4 na 6 s# 3 nas# s# 5 Is also memory efficient (about ~5m*sizeof(long) bytes for text without too much difficulty)

25 MATCHING PATTERNS USING SUFFIX TREES Consider the problem of finding pattern an in the text bananas# Root Two matches: positions 2 and 4 bananas# a na s# # Thread the pattern onto the tree 1 N3 N2 7 8 Completely spelled: report the index of every leaf below the point where spelling stopped. This is because the pattern is a prefix of every suffix spelled by traversing the rest of the subtree. a N1 n 6 s# 3 nas# s# 5 Incompletely spelled: no match 2 nas# s# 4 Runs in O(n+k) time, where n is the length of the pattern, and k is the number of matches.

26 FINDING LONGEST COMMON SUBSTRINGS USING SUFFIX TREES Given two texts: T and U find the longest continuous substring that is common to both texts N0 Can be done in O (len (T) + len (U)) time. $ %TCGA$ A CG G T Construct a suffix tree on T %U$ 10 5 $ N3 CGT%TCGA$ A$ N4 T%TCGA$ N5 A$ T%TCGA$ N6 %TCGA$ CGA$ Find the deepest internal node whose children refer to suffixes starting in T and in U E.g. T = ACGT, U = TCGA

27 SHORT READ MAPPING Next generation sequencing (NGS) technologies (454, Solexa, SOLiD) generate gigabases of short ( bp) reads per run A fundamental bioinformatics task in NGS analysis is to map all the reads to a reference genome: i.e. find all the coordinates in the known genome where a given read is located ATGGTCTAGGTCCTAGTGGTC Can take a LONG time to map 15,000,000 reads to a 3 gigabase genome!

28 BURROWS-WHEELER TRANSFORM BASED MAPPERS In 1994, Burrows and Wheeler described a lossless text transformation (block sorter), which makes the text easily compressible and is the algorithmic basis of BZIP2 Surprisingly, this transform is also very useful for finding all instances of a given (short) string in a large text, while using very little memory A number of NGS read mappers now use BWT transformed reference genomes to accelerate mapping by several orders of magnitude.

29 BWT Given an input text T=t[1]...t[N], we construct N left-shift rotations of the input text, sort them lexicographically, and map the input text to the last column of the sorted rotations: E.g. input ABRACA is mapped to CARAAB Note: sorted rotations make it very easy to find all instances of text in a string (also the idea behind suffix arrays) ROTATIONS SORTED A B R A C A B R A C A A R A C A A B A C A A B R C A A B R A A A B R A C A A B R A C A B R A C A A C A A B R B R A C A A C A A B R A R A C A A B

30 WHY BOTHER? The text output by BWT tends to contain runs of the same character and be easily compressible by arithmetic, run-length or Huffman coders, e.g. final char sorted rotations (L) a n to decompress. It achieves compression o n to perform only comparisons to a depth o n transformation} This section describes o n transformation} We use the example and o n treats the right-hand side as the most a n tree for each 16 kbyte input block, enc a n tree in the output stream, then encodes i n turn, set $L[i]$ to be the i n turn, set $R[i]$ to the o n unusual data. Like the algorithm of Man a n use a single set of probabilities table e n using the positions of the suffixes in i n value at a given point in the vector $R e n we present modifications that improve t e n when the block size is quite large. Ho i n which codes that have not been seen in i n with $ch$ appear in the {\em same order i n with $ch$. In our exam o n with Huffman or arithmetic coding. Bri o n with figures given by Bell \cite{bell}.

31 INVERSE BWT The beauty of BWT is that knowing only the output and the position of which sorted row contained the original string, the input can be reconstructed in no worse than O(N log (N)) time. Step 1: reconstruct the first column of rotations (F) from the last column (L). To do so, we simply sort the characters in L. Step 2: determine the mapping of predecessor characters and recover the input character by character from the last one PREDECESSOR CHARACTERS: RIGHT SHIFT MATRIX M (M ). ROTATIONS (M) A A B R A C A B R A C A A C A A B R B R A C A A C A A B R A R A C A A B SORTED STARTING WITH THE 2ND CHARACTER C A A B R A A A B R A C R A C A A B A B R A C A A C A A B R B R A C A A

32 M M A A B R A C A B R A C A A C A A B R B R AZC A A C A A B R A R A C A A B C A A B R A A A B R A C R A C A A B A B R A C A A C A A B R B R A C A A Both M and M contain every rotation of input text T, i.e. permutations of the same set of strings. For each row i in M, the last character (L[i]) is the cyclic predecessor of the first character (F[i]) in the original text We wish to define a transformation, Z(i), that maps the i-th row of M to the corresponding row in M (i.e. its cyclic predecessor), using the following observations M is sorted lexicographically, which implies that all rows of M beginning with the same character are also sorted lexicographically, for example rows 1,3,4 (all begin with A). The row of the i-th occurrence of character X in the last column of M corresponds to the row of the i-th occurrence of character X in the first column of M Z: [0,1,2,3,4,5] [4,0,5,1,2,3] F L L F PREDECESSOR

33 Z: [0,1,2,3,4,5] [4,0,5,1,2,3] In the original string T, the character that preceded the i-th character of the last column L (BWT output) is L[Z[i]] INPUT: T A B R A C A BWT (T) = L C A R A A B For example, for R (i=2), the predecessor in T is L[Z[2]] = L[5] = B For B (i=5), it is L[Z[5]] = L[3] = A If we know the position of the last character of T in L, we can unwind the input by repeated application of Z. Can use an inverse of Z to generate the input string forward

34 Software!""# $%&'()%* )+,%-. /0-1(),2"3,4551),63,78+9:-),;!< Open Access Ultrafast and memory-efficient alignment of short DNA sequences to the human genome =)& $ D%-FG)8' 7**8)55H,>)&+)8,I08,=909&I08(%+9:5,%&*,>0(@1+%+90&%-,=90-0'J3,4&5+9+1+),I08,7*E%&:)*,>0(@1+)8,D+1*9)53,K&9E)859+J,0I,A%8J-%&*3,>0--)'), C%8L3,AM,!"NO!3,KD7., Opportunistic Data Structures with Applications Uses BWT and opportunistic data structures (i.e. data structures working directly on compressed data) to build a compressed index of a genome Storage requirements for T=t[1]...t[N] are character bits/ Searching or k occurrences of a pattern (length m) can implemented in time O(m + k log N), > 0 Paolo Ferragina Giovanni Manzini its space occupan O(H k (T )) + o(1) for any fixed k). G

35 HASHING VS BWT AND OPPORTUNISTIC DATA STRUCTURES Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) Bowtie -v 2 Server 15 m 7 s 15 m 41 s , SOAP 91 h 57 m 35 s 91 h 47 m 46 s , Bowtie PC 16 m 41 s 17 m 57 s , Maq 17 h 46 m 35 s 17 h 53 m 7 s Bowtie Server 17 m 58 s 18 m 26 s , Maq 32 h 56 m 53 s 32 h 58 m 39 s The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National

36 INEXACT PATTERN MATCHING Homologous biological sequences are unlikely to match exactly; evolution drives them apart with mutations for example. Exact algorithms (e.g. local alignments) are quadratic in time and are too slow for comparing/searching large genomic sequences. Pattern matching with errors is a fundamental problem in bioinformatics finding homologs in a database. Well-performing heuristics are frequently used.

37 EXAMPLE: LONGEST COMMON SUBSTRING (LCS) IN INFLUENZA A VIRUS (IAV) H5N1 HEMAGGLUTININ LENGTH OF LCS (N=957 FROM 2005+) PROPORTION OF SEQUENCES WITH LCS Suffix trees can be adapted to efficiently find LCS from a proportion of a set of sequences as well. The longest fully conserved nucleotide substring in viruses sampled in 2005 or later is merely 8 nucleotides long This poses significant challenges for even straightforward tasks, such as diagnostic probe design

38 K-DIFFERENCES MATCHING The k-mismatch problem: given a text T (length m), a pattern P (length n) and the maximum tolerable number of mismatches k, output all locations i in T where there are at most k differences between P and T[i:i+n-1] The k-differences problem: can also match characters to indels (cost 1) -- a generalization. Both can be easily solved in O(nm) time, by either brute force or dynamic programming Viskin and Landau (1985) propose an O(m+nk) time algorithm for the k-differences problem by combining dynamic programming with text and pattern preprocessing using suffix trees of T%P$.

39 QUERY MATCHING If the pattern is long (e.g. a new gene sequence), it may be beneficial to look for substrings of the pattern that approximately match the reference (e.g. all genes in GenBank).

40 QUERY MATCHING Approximately matching strings share some perfectly matching substrings (L-mers). Instead of searching for approximately matching strings (difficult, quadratic) search for perfectly matching substrings (easy, linear). Extend obtained perfect matches to obtain longer approximate matches that are locally optimal. This is the idea behind probably the most important bioinformatics tool: Basic Local Alignment Search Tool (Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.), 1990 Three primary questions: How to select L? How to extend the seed? How to confirm that the match is biologically relevant?

41 keyword Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD neighborhood score threshold (T = 13) extension GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) Neighborhood words

42 SELECTING SEED SIZE L If strings X and Y (each length n), match with k<n mismatches, then the longest perfect match between them has at least ceil (n/(k+1)) characters. Easy to show by the following observation: if there are k+1 bins and k objects then at least one of the bins will be empty. Partition the strings into k+1 equal length substrings -- at least one of them will have no mismatches. In fact, if the longest perfect match is expected to be quite a bit longer (at least if the mismatches are randomly distributed), e.g. about 40 for n = 100, k = 5 (expected minimum is 17).

43 SELECTING SEED SIZE L Smaller L: easier to find, but decreased performance, and, importantly, specificity two random sequences are more likely to have a short common substring Larger L: could miss out many potential matches, leading to decreased sensitivity. By default BLAST uses L (w, word size) of 3 for protein sequences and 11 for nucleotide sequences. MEGABLAST (a faster version of BLAST for similar sequences) uses longer seeds.

44 HOW TO EXTEND THE MATCH? Gapped local alignment (blastn) Simple (gapless) extension (original BLAST) Greedy X-drop alignment (MEGABLAST)... A tradeoff between speed and accuracy

45 HOW TO SCORE MATCHES? Biological sequences are not random some letters are more frequent than others (e.g. in HIV-1 40% of the genome is A) some mismatches are more common than others in homologous sequences (e.g. due to selection, chemical properties of the residues etc), and should be weighed differently. A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V BLAST introduces a weighting function on residues: δ (i,j) which assigns a score to a pair of residues. HIV-WITHIN For nucleotides it is 5 for i=j and -4 otherwise. For proteins it is based on a large training dataset of homologous sequences (Point Accepted Mutations matrices). PAM120 is roughly equivalent to substitutions accumulated over 120 million years of evolution in an average protein

46 HOW TO COMPUTE SIGNIFICANCE? Before a search is done we need to decide what a good cutoff value H for a match is. It is determined by computing the probability that two random sequences will have at least one match scoring H or greater. Uses Altschul-Dembo-Karlin statistics ( )

47 STATISTICS OF SCORES Given a segment pair H between two sequences, comprised of r- character substrings T 1 and T 2, we compute the score of the H as: s(h) = r i=1 We are interested in finding out how likely the maximal score for any segment pair of two random sequences is to exceed some threshold X Dembo and Karlin (1990) showed that δ(t 1 [i],t 2 [i]) The mean value for the maximum score between two segment pairs of two random sequences (lengths n and m), assuming a few things about δ (i,j)), is approximately M = log(nm)/λ SOLVES i,j p i q j exp(λδ(p i,q j )) =0

48 STATISTICS OF SCORES (CONT D) For biological sequences, high scoring real matches should greatly exceed the random expectation and the probability that this happens (x is the difference between the mean and the expectation) is Prob{S(H) >x+ mean} K exp( λ x) K and λ are expressions that depend on the scoring matrix and letter frequencies, and the distribution is similar to other extreme value distributions. One can show that the expected number of HSPs high scoring segment pairs, exceeding the threshold S is E = Kmne λs

49 Mean HSP Random Mutated Count Log(mn) Score

50 E-VALUES Because thresholds are determined by the algorithm internally, it is better to normalize the result as follows: S = λs log K log 2 BIT SCORE E = nm2 S E-VALUE POISSON DISTRIBUTION FOR THE NUMBER K OF HSPS WITH SCORES S PROBABILITY OF FINDING AT LEAST ONE: exp E E k /k! 1 exp E

51 TIMELINE 1970: Needleman-Wunsch global alignment algorithm 1981: Smith-Waterman local alignment algorithm 1985: FASTA 1990: BLAST (basic local alignment search tool) 2000s: BLAST has become too slow in genome vs. genome comparisons - new faster algorithms evolve! BLAT Pattern Hunter

52 BLAT VS. BLAST BLAT (BLAST-Like Alignment Tool): same idea as BLAST - locate short sequence hits and extend (developed by J Kent at UCSC) BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database Index is stored in RAM resulting in faster searches Longer K-mers and greedier extensions specifically designed for highly similar sequences (e.g > 95% nucleotide, >85% protein)

53 BLAT INDEXING Here is an example with k = 3: Genome: cacaattatcacgaccgc 3-mers (non-overlapping): cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 cdna (query sequence): aattctcac 3-mers (overlapping): aat att ttc tct ctc tca cac Position of 3-mer in query, genome Hits: aat 4 cac 1,10 clump: cacaattatcacgaccgc Multiple instances map to single index!

Combinatorial Pattern Matching

Combinatorial Pattern Matching Outline Hash Tables Repeat Finding Exact Pattern Matching Keyword Trees Suffix Trees Heuristic Similarity Search Algorithms Approximate String Matching Filtration Comparing