Parsing and Pattern Recognition

Topics in IT 1 Parsing and Pattern Recognition Week 02 String searching and finite-choice languages College of Information Science and Engineering Ritsumeikan University 1

this week string comparison brute force using a hash function string searching brute force Boyer-Moore-Horspool algorithm state machine recognising finite-choice languages table lookup binary search state machine 2

last week s topics applications of pattern matching and parsing the parts of language: words and vocabulary: lexemes, lexicons sequences of words: sentences systems of sentences: grammars the structure of grammars 3

string comparison: brute force compare strings, character by character if all are the same, the strings are equal F O O T F O O T F O O T F O L D "FOOT" = "FOOT" "FOOT"!= "FOLD" int string_compare(char *s1, char *s2) { while (*s1) // not at end of s1 if (*s1!= *s2) return 0; // characters differ else ++s1, ++s2; // move to next character return *s2 == 0; // at end of s1, check s2 has ended too } 4

string comparison: brute force almost no extra cost to compute the order of the strings negative, zero, or positive results means first string is less than, equal to, or greater than the second string F O O T \000 F O O T \000 F O O T \000 F O L D \000 \0 - \0 = 0 "FOOT" = "FOOT" O - L = 3 "FOOT" > "FOLD" int strcmp(char *s1, char *s2) { while (*s1 && *s1 == *s2) // s1 not ended && s2 still matches ++s1, ++s2; // advance to next pair of characters return *s1 - *s2; // difference of first non-matching characters } 5

string comparison: hashing a hash is a small number calculated from some larger data the hash characterises the data equal data always has the same hash value e.g: parity bit, checksum (ISBN, student ID), MD5 signature,... if we create two hashes from a few characters of two strings if the hashes are different, the strings must be different if the hashes are the same, the strings might be different int my_hash(char *s) { // hash made from first, middle, and last character int last = max(0, strlen(s) - 1); int hash = (s[0] << 16) + (s[last / 2] << 8) + s[last]; return hash; } int compare(char *s1, char *s2) { if (my_hash(s1)!= my_hash(s2)) return 0; // different hash: strings must be different return!strcmp(s1, s2); // same hash: strings might be different } 6

string comparison: perfect hashing if all possible strings to be compared are known in advance, then a perfect hash function can be constructed automatically e.g., comparing only the strings cat, bet, and bob notice that the middle letter is different in all the strings the middle letter itself can be used as a perfect hash value /* For a given set of strings we can implement a simplest-possible hash * function that guarantees a different result for each string in the set. */ int my_hash(char *s) // require: s is one of "cat", "bet", or "bob" { return s[1]; // middle letter a, e, or o uniquely characterises s } int string_compare(char *s1, char *s2) { return my_hash(s1) == my_hash(s2); // equal perfect hash => equal strings } 7

string searching: brute force compare target string with contents of sliding window if they match, we have found the target string; otherwise slide the window one character to the right, and repeat moving window over text to be searched text to be searched target string to be found M E A S U R E M E N T S M E N M E N M E N M E N M E N M E N M E N M E N comparisons of target string with each part of the text to be searched that is currently within the bounds of the sliding window found at index 7 it took 12 comparisons to find men in measurements 8

string searching: brute force /* Search for the target string within the given text. Return the index of the match, or -1 if no match is found. */ int string_search(char *text, char *target) { int target_len = strlen(target); int last_win_pos = strlen(text) - target_len; for (int win_pos = 0; win_pos <= last_win_pos; ++win_pos, ++text) { for (int offset = 0; text[offset] == target[offset]; ++offset) { if (offset == target_len - 1) return win_pos; // target string found at win_pos } } return -1; // target was not found in text } 9

string searching: Boyer-Moore-Horspool problem with brute-force search: almost always fails when matching first character in window no information available about next character in the window must move window one character to the right M E A S M +1 U R E M E N T S E N M E N 10

string searching: Boyer-Moore-Horspool Horspool algorithm compares the target string and window contents backwards starting with last character if window does not match target: try to move the window as far to the right as possible the last character in the window is used to decide how far we can move the window possibility 1: last character in window does not occur in target move window right by the entire length of the target M E A S U R M E N +3 E M E N T S M E N 11

string searching: Boyer-Moore-Horspool possibility 2: last character in window occurs once in target move the window so the character appears in the correct position M E A S U R E M E N T S M E N +1 M E N M E A S U R E M E N T S M E N +2 M E N 12

string searching: Boyer-Moore-Horspool possibility 3: last character occurs more than once in target move so the last occurrence appears in the correct position M E A S U R E M E N T S C E M E N T +2 C E M E N T 0 1 2 3 4 5 M E A S U R E M E N T S C E M +3 E N T C E M 0 1 2 E N T 3 4 5 13

string searching: Boyer-Moore-Horspool possibility 4: strings differ, last character occurs only at end of target move window right by the length of the target (draw your own diagram if you cannot see why!) algorithm: build a table that maps any character to the amount to move right characters not in the target move the window by the target length if the last target character is not repeated, it moves by the target length other characters in the target move themselves to the end of the window for repeated characters, use the rightmost when searching for CEMENT or MEN, our tables look like: character: C M E N T others move by: 5 3 2 1 6 6 character: M E N others move by: 2 1 3 3 14

string searching: Boyer-Moore-Horspool (Boyer - Moore -) Horspool algorithm moving window over text to be searched text to be searched M E A S U R E M E N T S move[] = M E N? target string to be found 2 1 3 3 move[ A ] = 3 M E N +3 move[ R ] = 3 M +3 E N move[ E ] = 1 M +1 E N M E N compare target with window backwards found at index 7 6 comparisons to find men in measurements, but have to construct move[] array for each specific target string (practice on a few target strings until you find it easy) 15

string searching: Boyer-Moore-Horspool int string_search(char *text, char *target) { int text_len = strlen(text), target_len = strlen(target); if (text_len < 1 target_len < 1) return -1; // empty string int target_last = target_len - 1; // index of last character in target int window_pos = 0; // current position of window in text int move[256]; // amount to move window right // default: all characters move the window right by the target length for (int c = 0; c < 256; ++c) move[c] = target_len; // for characters appearing in target, move window right to align them with end of window for (int index = 0; index < target_last; ++index) move[target[i]] = target_last - index; // search for the target in text while (text_len >= target_len) { // not at end of text for (int index = target_last; text[index] == target[index]; --index) if (i == 0) return window_pos; // success if target matches window int n = move[text[target_last]]; // amount to move window right window_pos += n; // remember new position of window text += n; // move text (start of window) right text_len -= n; // text has shrunk by the same amount } return -1; // failure: target not found in text } 16

string searching: Boyer-Moore-Horspool Horspool works well for large alphabets and large target lengths e.g., phrases in natural languages 17

string searching: state machine use successive characters from input to drive a state machine approximately: any other character any other character any other character M E N success: target found 0 1 2 3 begin in state 0, then... look at the next input character, follow the arrow that matches if you reach state 3, stop and succeed (.. ) if you run out of input, stop and fail (.. ) why approximately? (hint: try searching for aba with input aaba ) later we see how to construct state machines properly matching flexible patterns 18

string comparison and grammars let s write string comparison as a grammar S hello this language has only one valid sentence, hello recognising whether or not a string belongs to this language is easy: compare the input string to the one valid sentence succeed if the string matches it // recognise production rule S int recognise_s(char *s) { if (!strcmp(s, "hello")) return 1; // succeed return 0; // fail } 19

string comparison and grammars that s boring, so let s recognise a more interesting language S hello S goodbye or... S hello goodbye or... S hello goodbye this language has two valid sentences, hello and goodbye recognising whether a string belongs to this language is also easy: compare the input string to all valid sentences succeed if the string matches one of them // recognise production rule S int recognise_s(char *s) { if (!strcmp(s, "hello")) return 1; // succeed if (!strcmp(s, "goodbye")) return 1; // succeed return 0; // fail } 20

string comparison and grammars a language that consists of fixed strings taken from a finite set of choices is called a finite choice language a grammar that describes a finite choice language is called a finite choice grammar are they useful? nouns in a natural language: cat, dog, totoro, pikachu, miffy reserved words in a programming language: class, public 21

finite-choice grammars many complex languages have a subset that is finite choice e.g., in the C programming language... keyword auto break case char const continue default do double else enum extern float for goto if int long register return short signed sizeof static struct switch typedef union unsigned void volatile while if we can recognise this FC language very quickly, we can treat all identifiers as if they were variable names recognise identifiers, using this FC grammar, to detect keywords 22

recognising sentences of FC languages consider a slightly smaller FC language S one two three four five six seven eight nine ten this language has ten valid sentences: one, two, three, four, five, six, seven, eight, nine, ten to recognise a valid sentence, just detect one of these strings 23

FC parsing: brute force brute force method: ten string comparisons // recognise the production S; return the number of // the rule that matched, or -1 if no rule matches int recognise_s(char *sentence) { if (!strcmp(sentence, "one" )) return 0; if (!strcmp(sentence, "two" )) return 1; if (!strcmp(sentence, "three")) return 2; if (!strcmp(sentence, "four" )) return 3; if (!strcmp(sentence, "five" )) return 4; if (!strcmp(sentence, "six" )) return 5; if (!strcmp(sentence, "seven")) return 6; if (!strcmp(sentence, "eight")) return 7; if (!strcmp(sentence, "nine" )) return 8; if (!strcmp(sentence, "ten" )) return 9; return -1; // fail } how tedious (and inefficient, for large FC grammars)! 24

linear search of a table FC parsing: linear search scales better (to hundreds of choices) smaller code (and probably faster) enum { NWORDS = 10 }; char *words[nwords] = { "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten" }; int recognise_s(char *sentence) { for (int i = 0; i < NWORDS; ++i) if (!strcmp(sentence, words[i])) return i; return -1; } 25

FC parsing: binary search since all sentences are known in advance, we can sort them alphabetically perform a binary search words[ ] 0 1 2 3 4 5 6 7 8 9 "eight" "five" "four" "nine" "one" "seven" "six" "ten" "three" "two" iteration 1 2 3 window lower mid upper mid+1 lower mid upper mid-1 mid found strcmp("seven", words[mid]) >1 <1 =0 26

FC parsing: binary search char *words[nwords] = { "eight", "five", "four", "nine", "one", "seven", "six", "ten", "three", "two" }; int recognise_s(char *sentence) { int lower = 0, upper = NWORDS - 1; while (lower <= upper) { int mid = (lower + upper) / 2; int cmp = strcmp(sentence, words[mid]); if (cmp < 0) upper = mid - 1; // not in [ mid...upper ] else if (cmp > 0) lower = mid + 1; // not in [ lower...mid ] else return mid; // sentence found } return -1; // sentence not recognised } rule numbering is slightly different (alphabetical order) trivial to fix with a corresponding table of rule numbers 27

FC parsing: state machine for illustration, consider a smaller grammar: S bet bob cat any other character any other character any other character b e t success: target found 0 1 2 3 c 5 o a 4 6 b any other character t any other character any other character (we see later how to implement this kind of matching very efficiently) 28

summary simple techniques and algorithms exists to improve efficiency of string comparisons string searching (within larger text) string searching (within table of strings) Horspool search algorithm is a good choice simple to understand and implement good for large alphabet and large target string (natural language) better algorithms exist for special cases (but more complex) parsing is the opposite of generating sentences from a grammar parsing: given a sentence and a grammar, how do we make the sentence from the start rule? parsing a language with a finite-choice grammar is just string search state machines can be used for matching and searching 29

review these slides homework practice making tables for Horspool string searching preview the slides for the next class become familiar with the notation used for grammars download and read the first two handouts reading-2.1-2.2.pdf Section 2.1.4 (how grammars are constructed) Section 2.2.1 (why grammars describe entire languages) (the rest is optional, but good background material if you are interested) reading-2.3.pdf Section 2.3 (the five types of grammar) 30

glossary brute force solving a problem by using a simple, obvious, direct approach. Much more efficient solutions may exist that are not obvious or simple. comparing determining if two sets of data have the same contents, e.g., two strings that contain the same characters in the same order will compare as equal. finite having a limited, countable number of elements. hash a value that characterises, and is computed from, data. Usually numeric, and much smaller than the original data, making the hash a useful in comparison, classification, and verification of data. For a given hash function, the same data should always produce the same hash value. When two hash values are different, we can be certain the the data they represent is different; when two hash values are the same, the data they represent may or may not be the same. 31

hash function a function that computes a hash value from a set of data. order the relative position of two data sets according to some classification scheme. Two numbers (including integers representing characters) can be ordered by their magnitude. Two strings can be ordered according to the order of the first character that differs between them (corresponding to dictionary order for English words). perfect hash function a hash function that is designed with prior knowledge of every possible input that it might encounter. Since the possible inputs to the function are known in advance, we arrange for the function to produce a different hash value for each possible input. This makes is possible to compare two (potentially large) data sets by computing their hash values directly, which will be the same only if the two data sets have the same contents. 32

searching finding the position of a target set of data within a collection of sets of data. It can be accomplished by comparing the target with each set of data in the collection successively until the comparison succeeds. sliding window (in data analysis) a window that moves for each iteration of an algorithm. For example, when searching for a string in some text, a window (of the same size as the target string) moves to a new position in the text each time a comparison is made between the target string and the portion of the text visible in the window. The term sliding implies that the movement is monotonic (in a single direction) and overlapping (moving a short distance relative to the size of the window). 33

state machine a way to model (or implement) a process as a software machine in which there are several distinct states and explicit transitions between them. The model is in only one of the states at a given time. Progress is made by following a transition out of the current state into another, when the next input data item (or some other external stimulus) is received. The current state combined with some characteristic of the data item determines which transition should be followed, and hence what state the machine will be in next. In text searching applications, the input data are successive characters of the text being searched and the machine states represent the progress that has been made towards recognising the target string that is being searched for. 34

window (in data analysis) a small (usually contiguous) subset of data taken from a larger set of data to which an algorithm or process is applied. The algorithm can only see the subset of data that is currently revealed by the window. Windows can be fixed, or they can move for successive iterations of the algorithm. If they move then successive windows can overlap or be non-overlapping. For example, when searching for a target string in some larger text, a window on the text reveals a sub-string having the same length as the target string. A comparison can be made directly between the target string and the text visible in the window. If the comparison fails the window is moved and the process repeats until the comparison succeeds or the entire text has been considered. 35