Implementing Dynamic Minimal-prefix Tries

SOFTWARE PRACTICE AND EXPERIENCE, VOL. 21(10), 1027 1040 (OCTOBER 1991) Implementing Dynamic Minimal-prefix Tries JOHN A. DUNDAS III Jet Propulsion Laboratory, California Institute of Technology, Mail Stop 510-202, 4800 Oak Grove Drive, Pasadena, CA 91109. U.S.A. SUMMARY A modified trie-searching algorithm and corresponding data structure are introduced which permit rapid search of a dictionary for a symbol or a valid abbreviation. The dictionary-insertion algorithm automatically determines disambiguation points, where possible, for each symbol. The search operation will classify a symbol as one of the following unknown (i.e. not a valid symbol), ambiguous (i.e. is a prefix of more than one valid symbol) or known. The search operation is performed in linear time proportional to the length of the input symbol, rather than the complexity of the trie. An example implementation is given in the C programming language. KEY WORDS Trees Searching Pattern matching Dictionary INTRODUCTION Recognizing keywords (symbols) from a dictionary (symbol table) is frequently needed by interactive and command-driven programs. Such programs typically require that the keyword search facility recognize valid abbreviations for each of the keywords. Traditional dictionary look-up methods, such as hashing or binary search trees, are inadequate because they do not generally allow the search keys to be abbreviated. When the frequency of the keyword lists changing is fairly high or is dynamic, it would be useful for the facility to conceal abbreviation characteristics of each of the keywords from the user (i.e. valid abbreviation points need not be explicitly specified). The remainder of this paper details the author s attempt at creating a trie structure with the following capabilities and performance. During construction of the dictionary, disambiguation points for each of the symbols need not be explicitly specified and the dictionary must be able to be updated dynamically. During dictionary search, character comparisons should be kept to a minimum, preferably no more than one examination of each input character and symbols which are valid unambiguous abbreviations must be accepted. The technique illustrated in this paper, based on the digital search tries of Knuth, 1 will be referred to as the keyword trie to distinguish it from other trie structures. 0038 0644/91/101027 14$07.00 Received 31 August 1990 1991 by John Wiley & Sons, Ltd. Revised 2 April 1991

1028 J.A. DUNDAS III Figure 1. PATTERN MATCHING DATA STRUCTURE The most fundamental decision was to represent the dictionary space as a trie so that quick multiway branching could be easily accommodated. Secondly, each node in the trie contains one or more characters of a symbol rather than a single node for each character of every symbol. The number of characters contained in a node is dependent upon the other symbols in the trie, and the worst case degrades to something resembling Knuth s implementation of one character per node. Nodes in the keyword trie fall into two categories: exit nodes and interior nodes. Each node is required to maintain some indication of which category it represents. The distinguishing characteristic between these two categories is that whereas exit nodes may be either terminal or non-terminal nodes, symbol search operations must terminate within an exit node to be successful. Search operations which terminate in an interior node must always fail. Exit nodes are further divided into those which have no children and those which do (i.e. terminal and non-terminal). If a search operation ends in an exit node with children, all of the characters in the node must match exactly and entirely. * If a search operation ends in an exit node without children, the input symbol is not required to match entirely but must still match exactly. In fact, it is necessary to match only the first character of the node. This rule allows the algorithm successfully to match abbreviations of symbols in the trie. Examples will help to clarify the situation. To illustrate the data structure and the various types of nodes, we will construct a keyword trie given the symbols {he, she, his, hers. In the following diagrams, exit nodes are represented by boxes and interior nodes are represented by ovals. Inserting the first symbol he into the trie, we obtain Figure 1. The node associated with he is considered an exit node. Since it has no children, a successful match will be generated by either of the strings h or he. Inserting the second symbol she, we obtain Figure 2. Up to this point, the trie looks similar to a digital trie, although one node of the Figure 2. * The reason for this requirement will be explained in the next section.

IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1029 Figure 3. keyword trie corresponds to one or more nodes of the digital trie. Inserting the third symbol his, however, gives the trie in Figure 3. At this point the digital nature of the trie is beginning to become apparent, and all terminal nodes represent complete recognition of keywords. Adding the final symbol hers changes this, though (see Figure 4 ). Upon inspection, the following statements can be made: 1. The symbol she could be unambiguously represented by {s, sh, she. 2. The symbol his could be unambiguously represented by {hi, his. 3. The symbol hers could be unambiguously represented by {her, hers. 4. The symbol he cannot be abbreviated at all and therefore can only be recognized in its complete form {he. The trie shown in Figure 4 illustrates each of the different types of nodes. The node labelled h is an interior node. The node labelled e is an exit node with children. The remainder of the nodes are all exit nodes without children. Figure 4.

1030 J.A. DUNDAS III AMBIGUOUS PATTERN RESOLUTION Having built the trie shown in Figure 4, we need a mechanism for disambiguating patterns which are suspected members of the trie. The following two rules suffice in this process. 1. If an input pattern matches through any prefix of an exit node that does not have any children, the symbol is considered to be successfully matched. 2. If an input pattern matches completely through an exit node that does have children, the symbol is considered to be successfully matched. Some examples may help to clarify the specified conditions. Given the trie constructed in Figure 4, if a pattern starts with any character not in {h, s the pattern must be rejected for not being a member of the trie. In the simple case of (possibly abbreviating) she, any of {s, sh, she will be accepted as valid per Rule 1. However, only this set is a valid specification of she. If there are additional or incorrect characters (e.g. {so, shy, sheep ) the pattern is rejected. Other input patterns such as {her, hers and {hi, his will also be recognized by using Rule 1. Rule 2 applies for the input pattern {he. Since this pattern is itself a prefix to another valid pattern, it must be completely specified to be recognized. There is no valid abbreviation for he. If the input pattern contains only h, it must be rejected as being an ambiguous prefix to other valid symbols. TRIE CONSTRUCTION Constructing the search trie is reasonably straightforward. One pass is made over each symbol to be inserted into the trie. Nodes are either added or split as necessary. At most one split and/or one node addition will be required for each symbol insertion. The construction algorithm is specified as follows: Step 1. If there are no nodes in the trie, other than the root, create a node containing the entire symbol and attach it to the root. This node is an exit node. The insertion operation is complete. Step 2. Search down the trie one character at a time through the input symbol, stopping on one of the following: a character mismatch, exhausting the characters in the input symbol or exhausting nodes in the trie. Step 3. If a mismatch occurred in Step 2, create a new node containing the portion of the string of the original node that mismatched. Shorten the string in the original node to include only the matched characters. The new node should be made a child of the original node and inherit all of its children. The original node is now a non-exit node. Step 3(a). Create a new node containing the mismatched portion of the input symbol. This node should be made a child of the original node. This node is an exit node. The insertion operation is complete. Step 4. If the characters in the input symbol are exhausted, the characters in the current node are exhausted and the current node is an exit node, this is a duplicate symbol insertion. No operation needs to be performed. The insertion operation is complete. The exact details of the search operation will be specified in the next section.

Step 4(a). Step 4(b). Step 5. IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1031 If the characters in the input symbol are exhausted, the characters in the current node are exhausted and the current node is not an exit node, then make the current node an exit node. The insertion operation is complete. If the characters in the input symbol are exhausted and the characters in the current node are not exhausted, create a new node containing the remaining characters of the string within the original node. Shorten the string in the original node to include only the matched characters. The new node should be made a child of the original node and inherit all of its children. Make the original node an exit node. The insertion operation is complete. If the nodes in the trie have been exhausted before characters in the input pattern. create a new node containing the remaining characters which did not match. Make this new node a child of the node which the search ended on. This new node is an exit node. The insertion operation is complete. (Note that Step 1 is actually a special case of this step.) The Figures shown previously illustrate most of these steps. Some clarification is still needed, though. Step 4(a) can be illustrated with the trie shown in Figures 5(a) and 5(b). The illustration shows the operations required to insert the symbol he into a trie containing the symbols {hers, head. Note that no node is split and no new nodes are created. The interior node he is simply converted to an exit node with children. Step 4(b) can be illustrated through the example in Figure 6. Figure 6 shows the operations required to insert the symbol he into a trie containing the symbols {she, hers. Note that only a split operation is needed to transform Figure 6(a) into Figure 6(b) and that both nodes now represent exit nodes. At all times, the keyword algorithm maintains a valid keyword search trie. Additionally, we can state that the order of insertion for symbols is irrelevant. SEARCH ALGORITHM The search algorithm is extremely simple. Each character within an input pattern need be examined only once. For each character within the pattern, progress is made Figure 5.

1032 J.A. DUNDAS III Figure 6. to determine one of three outcomes: the pattern does not represent a valid symbol, the pattern is ambiguous or the pattern represents a (possibly abbreviated) valid symbol. Starting at the root of the trie, the search algorithm may be specified as follows: Step 1. Step 2. If there is no node among the siblings at this level with an initial character that matches the current input pattern character, return failure. While the corresponding character within the node and the current character in the pattern match, and neither string is exhausted, advance the pattern pointer and pointer into the node string one character. At this point, one or more of the following three conditions is true: a mismatch has been detected, the pattern has been exhausted or the string fragment within this node has been exhausted. Step 3. If the pattern is exhausted and the current node is not an exit node, return failure due to ambiguity. Step 3(a). If the pattern is exhausted and the current node is an exit node, return success. Step 4. If a character mismatch occurred, return failure. Step 5. If the pattern is not exhausted, the string within the node is exhausted and the node has no children, return failure. Step 6. Make the first child of the node the current node. Go to Step 1. Note that this algorithm can be implemented either recursively or non-recursively (by eliminating the tail recursion). Searching ALGORITHM COMPLEXITY One can immediately observe that exactly one pass is made over the input pattern. Only in Step 1 can a character be examined more than once. This occurs while

IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1033 looking for an appropriate sibling node to traverse. Since there will only be one node, at most, which will match the pattern character, no backtracking in either the trie or the pattern is necessary. Within an implementation, multiple comparisons of the initial character can be eliminated, as will be discussed. Additionally, a straight traversal to the proper node or search failure is guaranteed by this algorithm. Thus, substrings within each of the nodes are probed only once. The time (in number of character comparisons) necessary to determine success or failure of a pattern match is therefore proportional to the length of the input pattern. Given an input pattern of length n, no more than n character comparison operations will be performed to determine the outcome. (Failure due to misspelling is often encountered earlier. ) Step 2 always advances pointers and never allows backtracking. The algorithm does not explicitly specify a method for determining the next search node in Step 1. The implementor is free to use whatever method seems most appropriate. Any of the structures array, bit map, linked list, hash table, etc., might be used. The method chosen determines the time spent in this step and can be as low as constant time. Note that Steps 3 to 6 take unit time. Trie construction Steps 1, 2 and 5 in constructing the keyword trie are essentially the search algorithm. The time needed to find the appropriate place to modify the trie in some way is the same as the searching problem. Steps 3 and 4 are actually independent of the algorithm and rely more on the programming language used for implementation and its support library. If a memory allocator is used, as opposed to an array representation, time spent in the memory allocator will almost certainly dominate in these steps. SAMPLE IMPLEMENTATION To demonstrate a correct implementation of the keyword data structure and algorithm, Listing 1 in the Appendix gives a complete and working implementation in C. This implementation has been tested with a number of different machine and compiler combinations, both ANSI and non-ansi conforming, and is believed to be portable (although it does not make use of ANSI specific features). Throughout the insert and lookup routines, comments within the code indicate what steps in the respective algorithms are being performed. Within the declared structure of a node, the next field is used to form a list of siblings. This is a forward pointer to the next sibling in the list and has the value NULL when the end of the list is reached. The child field has the value NULL if a node has no children, otherwise it contains a pointer to the first child in the list of children for a node. The value field is used to associate a user-specified constant with each symbol. This field may take on any value other than 0 or 1 and is otherwise not used by the routines. The presence of a non-zero value in this field serves to distinguish exit nodes from non-exit nodes. Furthermore, the value 1 is returned by the searching routine to indicate that an ambiguous pattern has been specified. Space at the end of the node structure is declared for a variable-length string which holds the symbol fragment associated with this node. In this implementation, the node structure is actually a variable-length structure based on the length

1034 J.A. DUNDAS III of the string fragment and the size of the fixed fields. The local variable table serves as the root of the keyword table and is initialized at compile time with the appropriate values. The function insert attempts to insert a symbol into the keyword trie. Associated with the symbol is a user-specified value (which is generally opaque to all of the trie functions). The function returns the value 1 upon successfully inserting the symbol, otherwise it returns a 0. If a symbol is installed more than once, a 1 is returned. In practice, an error indication could be returned or the associated value updated (see Step 4 ). This implementation uses the memory allocator to allocate space for new nodes as needed. There are checks in the code for out of memory conditions and a 0 is returned. The function lookup attempts to locate a symbol within the keyword trie that matches the input pattern. This function returns 0 if the pattern cannot be located. The value 1 is returned if the pattern is ambiguous. If the pattern successfully matches a symbol, the value field of the corresponding exit node is returned. The function dump is a useful utility routine that can be called to create a visual representation of the trie. This routine calls the recursive function listing to perform a depth-first traversal of the trie. Whenever exit nodes are encountered, the routine emits a line containing the symbol and the user specified value that was associated with the symbol when it was installed in the trie. For each branch point in the trie, a vertical bar is placed between characters so that distinct nodes forming a complete symbol can be easily seen. Figure 7 shows the output from this routine given the symbols {he, she, his, hers installed in a keyword trie. The sample main routine provided in Listing 1 in the Appendix gives examples calling each of the trie manipulation routines. The four symbols used as examples throughout this paper are inserted and subsequently searched for. The trie is then dumped to the standard output file as illustrated in Figure 7. Finally, the program reads lines of input patterns from the standard input file and emits the value returned by lookup to the standard output file. The program terminates when the end of file has been read on standard input. Searching EFFICIENCY CONSIDERATIONS Efficient implementation of Step 1 has the potential to return the most substantial impact on the execution time of this algorithm. Implementations by the author have used a generalized linked list package similar to that described by Levy. 2 However, " h e", value 1 " h e r s", value 4 " h is", value 3 " she", value 2 Figure 7.

IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1035 a character-indexed array of pointers and bit maps would also be quite promising. Each of these would trade extra space for the time needed to locate an appropriate subordinate node. If this algorithm is loaded with a dictionary of some large number of symbols, the decreased time to find the next node to search may well be worth the increased space requirements. Additionally, the implementor may choose to take a mixed approach using different methods for each node based on the number of offspring nodes. OTHER CONSIDERATIONS An extensive gallery of techniques and improvements can be brought to bear on the implementation given. In fact, the author uses a version of this algorithm allowing multiple tries to be created, destroyed and searched during the lifetime of an executing program. The remainder of this section will outline potential modifications which could be made to adapt the keyword trie to specific applications. One might choose to perform certain optimizations only at the root level of the trie so as to minimize the time spent looking for a node to traverse. Using an array of character pointers enables one to determine, in constant time, whether the first character of an input symbol is contained in any of the nodes connected to the root. A pointer array also has the advantage of providing the pointer of the appropriate node to begin the search operation in. Similarly, a bit map of all valid initial characters can also speed up the first probe. This has a similar speed advantage to that of the array and consumes significantly less space. However, the algorithm still needs to locate the appropriate node to start the search. At all levels of the trie, bit maps and arrays can also be used. However, this use of space may become unacceptable, and the linked list approach shown in Listing 1 may appear more attractive. To speed the search of this list, nodes could be maintained in radix-sorted or most frequently used order. Either of these methods will tend to improve the performance of tracing down the node list as opposed to the random ordering used in the current implementation. Specific additional functions that could be added include optional alphabetic character case insensitivity, eliminating restrictions on the value associated with a symbol, support for multi-byte character sets, symbols denoted by buffer address and length (rather than C strings as given), implementing the trie as an array rather than as a linked list, supporting multiple simultaneous tries through the use of opaque handles to the trie roots, symbol and trie deletion, and adding support for ANSI C constructs (e.g. the associated value should be a void * rather than a long ). On systems which support the alloca () call, the listing () routine can be altered as follows: replace all calls to malloc () with calls to alloca () and eliminate all calls to free (). Adding support for right-to-left languages would be an interesting exercise. In fact, Knuth (Reference 1, p. 483) suggests that this approach may also be appropriate for cases where a number of symbols contain long common prefixes. Douglas Schmidt at the University of California, Irvine, has created a complementary set of routines for the Free Software Foundation. His package, called trie-gen, reads an input keyword list and emits initialized data structures and code in C++ to perform a similar search function. The search performance of the package should

1036 J.A. DUNDAS III be nearly identical to that presented in this paper. The space required for the static tables should be somewhat less than that required for the dynamic implementation given here. CONCLUSION A set of C functions, very similar to those given in Listing 1, has been in use for a number of years in applications where keywords must be recognized by possible abbreviation and the keyword list is subject to change while the programs are executing. Typical uses for a facility such as this include command line interpreters (shells), debuggers, editors and other interactive or programmable tools. In fact, any application where a relatively small dictionary is in use and abbreviation is allowed will find the routines given to be useful. ACKNOWLEDGEMENT This paper was prepared for publication by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. #include <stdio. h> struct node { struct node * next; /* struct node *child; /* long value; /* char. str [1] ; /* ; APPENDIX: LISTING 1 Pointer to next sibling or NULL */ Pointer to children or NULL */ Associated exit value (opaque) */ Variable length string */ struct node tree = ( NULL, NULL, 0L, '\0' ; static struct node *new (value, frag) long value; char *frag: { struct node *n; extern void strcpy ( ) ; n = (struct node *) malloc (sizeof (struct node) + strlen (frag) + 1) ; if (n == NULL) return (NULL) ; n->next = NULL; n->child = NULL; n->value = value; strcpy (n->str, frag) ; return (n) ; int insert (sym, value) char * sym; long value; { char * s; struct node *p, *q;

IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1037 if (value == OL value == -1L) return (0) ; if ( (p = tree. next) == NULL) { while /* Step 1 */ q = new (value, sym) ; if (q == NULL) return (0) ; tree.next = q; return (1) ; (1 == 1) { for (; p!= NULL; q = p, p = p->next) ( if (*p->str == *sym) ( s = p->str; while (*s!= '\0 && *S == *sym) { s++; sym++; if (*sym == '\0') { if (*s == '\0') { if (p->value!= 0) { /* Step 4 */ return (1); else { /* Step 4a */ p->value = value; return (1) ; else { /* Step 4b */ q = new (p->value, s) ; if (q == NULL) return (0); q->child = p->child; p->child = q; p->value = value; *s = '\0'; return (1) ; if (*s!= '\0') { /* Step 3 */ q = new (p->value, s); if (q == NULL) return (0); q->child = p->child; p->child = q; p->value = 0L; *s = '\0'; /* Step 3a */ p = new (value, sym); if (p == NULL) return (0); q->next = p; return (1); if (p->child == NULL) { /* Step 5 */ q = new (value, sym) ; if (q == NULL)

1038 J.A. DUNDAS III return (0) ; p->child - q; return (1); ) else { p = p->child; break; if (p == NULL) { /* Step 5 */ p = new (value, sym); if (p == NULL) return (0) ; q->next = p; return (1); long lookup (pat) char *pat; { char *s; struct node *p; /* Step 1 */ if ((p = tree.next) == NULL) return (0); while (1 == 1) { do { if (*p->str == *pat) { s = p->str; /* Step 2 */ while (*s!= '\0' && *s == *pat) { s++; pat++; if (*pat == '\0') { if (p->value == 0L) /* Step 3 */ return (-1); else /* Step 3a */ return (p->value); /* Step 4 */ if (*s!= '\0') return (0); /* Step 6 */ if ((p = p->child) == NULL) /* Step 5 */ return (0); else break; while ((p = p->next)! = NULL); /* Step 1 */ if (p == NULL) return (0);

IMPLEMENTING DYNAMIC MINIMAL-PREFIX TRIES 1039 static void listing (p, parent) struct node *p; char *parent: { char *buffer; struct node *u; extern void strcpy (), strcat (), printf (), free (); void { main { buffer = (char *) malloc (strlen (parent) + strlen (p->str) + 2); if (buffer == NULL) return; strcpy (buffer, parent); strcat (buffer, " "); strcat (buffer, p->str); if (p->value!= 0) printf ("\"%s\", value %ld\n", buffer, p->value); if ((q = p->child) == NULL) { free (buffer); return: do { listing (q, buffer); while ((q = q->next)!= NULL); free (buffer); dump () struct node *p; if ((p = tree.next) == NULL) { puts (".. tempt")..."); return: do ( listing (p, ""); while ((p = p->next)!= NULL); () char buffer[bufsiz], *s; extern char *fgets (), *strchr (); printf ( Adding \ he\ = printf ( Adding \ she\ = printf ( Adding \ his\ = printf ( Adding \ hers\ = printf ( Lookup \ he\ = printf ( Lookup \ she\ = printf ( Lookup \ his\ = printf ( Lookup \ hers\ = dump (); %d\n, insert (When, 1)); %d\n, insert ( she, 2)); %d\n, insert ( his, 3)); %d\n\n, insert ( hers, 4)); %ld\n W, lookup ( he )); %ld\n W, lookup ( she )); %ld\n W, lookud ( his )); %ld\n\n, lookup ( hers while (fgets (buffer, BUFSIZ - 1, stdin)!- NULL) if ((s = strchr (buffer, \n ))!= NULL) *s = '\0'; ));

1040 J.A. DUNDAS III printf ("Lookup \"%s\" = %ld\n", buffer, lookup (buffer)); exit (1); REFERENCES 1. D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, Addison-Wesley, Reading, MA, 1973, pp. 481 499. 2. E. Levy, The linked list class of Modula-3, SIGPLAN Notices, 23, (8), 93 102 (1988).