Non-word Error Detection and Correction

Size: px

Start display at page:

Download "Non-word Error Detection and Correction"

Maximillian Hancock
6 years ago
Views:

1 Non-word rror Detection and Correction Prof. Bidyut B. Chaudhuri J. C. Bose Fellow & Head CVPR Unit, Indian Statistical Statistics Kolkata

2 2

3 Word Mis-typing or Unknown Spelling Real word rror Non-word error Syntax anomaly, Semantic anomaly Nonsensical situation 3

4 Low level task (Spell-checker) Find incorrect words ( Non-word errors ) Suggest correct alternatives and rank them. Correct automatically / interactively. High level tasks (Real-word error correction) Find lexically correct but syntactically and semantically incorrect words ( Real word errors) Suggest correct alternatives and rank them. Correct automatically. Some Spell check software in nglish UNIX spell, spell, Grope, CLR, SPDCOP, Spellex etc. 4

5 1. Split word: When a space is wrongly inserted within the word. 2. Run-on or merged words: When the space between two or more words are not inserted. 3. Character Insertion, Deletion and Substitution (IDS) :When one or more character are substituted or deleted. lso, when a character is inserted in the word. The split word and run-on errors are to be checked before going for IDS error correction. Usually the character string is checked in a word list (dictionary). If there is no match, this string is not a valid word. Then the correction effort is started. 5

6 If (according to dictionary check) two consecutive strings are nonwords then we can merge them and check the merged string in the dictionary. If it is a valid word, then split word error has been detected and corrected. If one string is not a valid word, we can see if a portion of this (from left side) is a valid word. If yes, then we can check if the rest is also a valid word. Then we can consider that a merged word error has been detected. It is corrected by inserting a space in between these two words. 6

7 W not in word list Find correction candidates Rank the candidates Input W Present the user with best 5 candidates W is present in word list Declare valid word 7

8 S P L L C H C K Substitution 8

9 S P L X C H C K Substitution 9

10 S P L L C H C K Deletion 10

11 S P L L H C K Deletion 11

12 S P L L H C K Deletion 12

13 S P L L C H C K Insertion 13

14 S P L L X C H C K Insertion 14

15 S P L L C H C K Transposition 15

16 S P L C L H C K Transposition The substitution and transposition can be composed of multiple insertion and deletion, which are basic operations. 16

17 1. Language issue : Word morphology - Degree of inflectionality. Diglossia, cho word, Onamotopoea. 2. Script issue : lphabet size - Character shape, Presence of vowel modifier and Compound character. 3. Spelling issue : lternative spelling, Standardization of spelling. 17

18 4. rror Pattern Issue : Single vs. multiple, Substitution, Deletion, Insertion, Transposition. Phonetic/Graphemic similarity. Other tendencies. 5. pplication rea Issue : () Subject based: Newspaper text, Official letters, notes and report preparation, Technical book writing, Story & Novel writing. (B) Technology output based : OCR output, Speech recognition output, Braille to text output. 18

19 String of length n can have 2n+1 error/correction positions rror positions Original word C R B O N Word Character posn Odd numbered position : Substitution, Deletion (single) ven numbered position: one or more Insertions Substitution at position 11 : CRBOL Double insertion at position 10 : CRBOYLN Deletion at position 7 & insertion at position 12 : CRONS 19

20 Dictionary Look up: For a word in the document, check if it is listed in the dictionary, If yes, pass it as a valid word. lse, indicate that it is incorrect word and provide suggestions. N-Gram: Store all possible N-grams in a N-dimensional array. For a word in the document, check if all its N-grams are there in the array. If yes, pass it as a valid word. lse, indicate that it is a incorrect word and generate suggestions. (Useful for OCR error correction) Morphological analysis: It is almost impossible to generate a dictionary containing all inflected words. Morphological analysis is used to strip the suffix, verify the root word and check if the suffix morphologically agrees with the root word. If yes, the word passes as valid one. lse, it is stopped as incorrect word. 20

Minimum dit Distance: The minimum number of editing operations (Insertion, Deletion, Substitution, Transpositions) needed for converting one string of characters into a valid word.

21 Minimum dit Distance: The minimum number of editing operations (Insertion, Deletion, Substitution, Transpositions) needed for converting one string of characters into a valid word. Proposed by Damaraeu and Levensthein and goes under their name. Needs dynamic programming to compute. rror Correction pproach: Find minimum dit distance of the misspelled string from all words in dictionary. Those having least edit distance are the suggested words for correction. Reversed dit Distance: The above approach needs computations of dit Distances on the whole dictionary. n easier approach is Reverse edit distance where the error string is converted by editing operations and the resulting strings are tested for valid words in the dictionary. 21

22 (a) Similarity Key technique: xploits phonetic similarity between misspelled string and intended word. (spell software partially uses this approach) (b) Rule-based technique: Some spelling error patterns can be represented in the form of rules. This class of techniques tries to build a kind of xpert system. (c) N-gram based techniques: Tries to replace the impossible bigrams or trigrams by possible ones and check if this task makes a valid word or not. s stated before, it is more potential for OCR error correction. (d) Probabilistic technique: Tries to exploit Bayes rule as well as Transition probability and Confusion probability. (e) Neural Net and evolutionary computing: Multi-layer perceptron is trained with erroneous string vs. valid word. (f) Word Trigram based error correction: Church and Gale (1991), Brill and Moore (2000) used word trigram library to choose and rank the suggestion words. 22

23 Let G and O be a dictionary word and the typed string, respectively. If length of G is n characters and length of O is m characters, then the edit distance D(i, j) is recursively computed as D(i,j) = Min [D(i-1,j) + C d (G i ), D(i-1,j-1) + C s (O j, G i ), D(i,j-1) + C i (O j )] Where, D(0,0) is initialized to zero. C s = substitution cost, is zero if O j = G i and is 1, otherwise. C d = deletion cost = 1 C i = insertion cost =1. D(n, m) is the Minimum dit Distance between O and G. 23

24 D- Deletion, M- Match, R- Replacement, I- Insertion 24

25 25

26 From the original dictionary D, a reversed dictionary D r is formed. If COPY is a word, then YPOC is its reversed version. ll words of D are reversed to get D r, which is alphabetically ordered. In general, W i r (j) = W i (L i 1 j ) for 0 j L i 1 lso 1 shifted dictionary D 1 is formed by shifting 1-st character of words in D to the last position W i 1 (j) = W i ( j 1 ) for 1 j L i & W i1 (L i ) = W i ( 1 ) Similarly 2 shifted dictionary D 2 is formed by shifting 1-st character of words in D to the last position W i 2 (j) = W i1 ( j 1 ) for 1 j L i & W i2 (L i ) = W i 1 ( 1 ) 26

27 D o Original Word SPLLCHCK CORRCTION BSOLUT D r Reversed Word KCHCLLPS NOITCRROC TULOSB D 1 1 Character shift PLLCHCKS ORRCTIONC BSOLUT D 2 2 Character shift LLCHCKSP RRCTIONCN SOLUTB 27

28 For quick search and access, the dictionaries are arranged in a trie structure. Trie comes from the word retrieval whose proposer is dward Fredkin. Trie is a tree-like structure where each node corresponds to a character of the dictionary and branch shows the sequence of characters in the word. To economize space we can combine tries for D o, D r, D 1 and D 2 into a single multi-trie structure. 28

29 .g. If the wordlist has the entries stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red color. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 29 B. B. CHUDHURI, CVPR UNIT

30 D M O N S P R T I O N (a) n D M O N S P R T I O N Search stopped after this in D n D M O N S P R T I O N (b) (c) (a) Wrong Word string S (b) Forward Dictionary (D) search (c) Reversed Dictionary (D r ) search Search stopped after this in D r rror Zone 30

31 rror zone length % of (in no. of characters) strings rror located at either end of error zone Prof. B. B. Chaudhuri, Indian Statistical Institute, Kolkata

32 1 2 rroneous string S is partitioned into two equal regions (1) and (2). Let, their lengths be n. If (1) is error-free then find valid words W 1 in the dictionary D o of length n, n+1, n-1. If (2) is error-free then find valid words W 2 in the reversed dictionary D r of length n, n+1, n-1. Union of W 1 and W 2 is the list of candidate words. This approach reduces the amount of search. The method can be extended to two-position and more errors as well. (How?) 32

33 1 2 3 Case 1: Both errors are in region (2) & (3). Hence region (1) is error-free. Use original dictionary D 0 for correction candidate. Case 2: Both errors are in region (1) & (2). Hence region (3) is error-free. Use reversed word dictionary D r for finding correction candidate. Case 3: One error is in (1) & other in (3). So (2) is error-free. For this case we need dictionaries D 1 and D 2 on modified input string. 33

34 For the wordlist stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red and D 1 in violet colour. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 34 B. B. CHUDHURI, CVPR UNIT

35 B S O L U T ctions to be taken B S O L U T X Zero shift Search in D 1 Z B S O L U T X One Search in D 0 Z B S O L U T X character to Search in D 1 S O L U T X right end Search in D 2 Z B S O L U T Z S O L U T X X Two characters to right end Search in D 1 Search in D 2 35

36 Step 1: If the current test word W exist in dictionary D, go to 6. else, paint error color on the word and continue. Step 2: Test W in D and D r Trie for 1-error suggestion generation. If 5 suggestions are found, go to step 5. Step 3: Test W in D, D r, D 1, D 2 Trie for 2-error suggestion generation, collect the suggestions and go to step 5. Step 4: If suggestion list is empty, display NO SUGGSTION. Go to step 6. Step 5: Use Phonetic similarity, keyboard neighborhood and word popularity to rank the suggestions. Display at most 5 ranked words. Step 6: if W is the last string, XIT. lse, take next word from the input file and go to step 1. 36

37 Space bar Candidate key: D First order neighboring keys: S, F Second order neighboring keys: X, C and 37

38 Neighboring Key Character weight: Weight Normalized Weight 1 st -order neighbor 5 5/9 2 nd -order neighbor 3 3/9 Other Characters 1 1/9 Phonetic similarities based weight: Phonetically similar characters 3 3/4 Other Characters 1 1/4 diting operation based weight: Candidate generated by substitution 3 3/4 Candidate generated by insertion/deletion 1 1/4 Word statistics based weights: a) First order: If k suggestions are generated, rank them according to their prior probability in corpus. The top rank gets weight k (normalized as k/ k), next one get weight k-1 (normalized as (k-1)/ k) and so on. b) Second order: Conditional probability (bigram) is employed to generate weight. These weights are linearly combined to get the score for each suggestion word. The words are then ranked according to this combined weight and displayed. 38

39 1. Declares correct words as erroneous ones (some language specific cases also. Like echo-form chai wei in Hindi). 2. Detects error, but fails to suggest alternative words. 3. Detects error and suggests alternatives but the top suggestions do not contain the intended word. 4. Detects error and suggests alternatives but they do not include the intended word. 5. Fails to detect run-on and split word error. 6. Fails to detect real-word error. 39

40 Indian language writing systems have more characters and modifiers (more than double w.r.t. nglish). lso, a joiner is needed to form compound character. So, the number of nodes in the Tries for dictionary is increased. So does the search time. To reduce the searching space we can club similarly sounding characters into single symbol. lso, a vowel and its modifier can be given single tag. For each character there is a set of distinct characters that can follow it. This is true for compound character as well. Such information can be stored in the form of a table and hence the Trie traversal can be made more efficient. 40

41 Useful for similar sounding character substitution error detection and correction. Club Long and short vowels (u, U; i, I etc.) and Consonants (r, R; n, N; s, S; J) having phonetically similar utterance into single entities. Re-organize this Semi-Phonetic dictionary into some semialphabetic ordering. Pointer is kept to all valid graphemic words for a Phonetic word. If the error is purely phonetic Substitution, then it can be easily detected and corrected using this dictionary. 41

42 म म र मण न ल र ढ मह र म म र मननननल र र मह सर 42

43 43

44 I invite all students, researchers and faculty members present here to work for the development of Indian language technology. More specifically I would request you to develop basic tools like spell-checker, real word error corrector, electronic thesaurus and word net in Indian languages. Thank You 44

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm