Non-word Error Detection and Correction

Size: px
Start display at page:

Download "Non-word Error Detection and Correction"

Transcription

1 Non-word rror Detection and Correction Prof. Bidyut B. Chaudhuri J. C. Bose Fellow & Head CVPR Unit, Indian Statistical Statistics Kolkata

2 2

3 Word Mis-typing or Unknown Spelling Real word rror Non-word error Syntax anomaly, Semantic anomaly Nonsensical situation 3

4 Low level task (Spell-checker) Find incorrect words ( Non-word errors ) Suggest correct alternatives and rank them. Correct automatically / interactively. High level tasks (Real-word error correction) Find lexically correct but syntactically and semantically incorrect words ( Real word errors) Suggest correct alternatives and rank them. Correct automatically. Some Spell check software in nglish UNIX spell, spell, Grope, CLR, SPDCOP, Spellex etc. 4

5 1. Split word: When a space is wrongly inserted within the word. 2. Run-on or merged words: When the space between two or more words are not inserted. 3. Character Insertion, Deletion and Substitution (IDS) :When one or more character are substituted or deleted. lso, when a character is inserted in the word. The split word and run-on errors are to be checked before going for IDS error correction. Usually the character string is checked in a word list (dictionary). If there is no match, this string is not a valid word. Then the correction effort is started. 5

6 If (according to dictionary check) two consecutive strings are nonwords then we can merge them and check the merged string in the dictionary. If it is a valid word, then split word error has been detected and corrected. If one string is not a valid word, we can see if a portion of this (from left side) is a valid word. If yes, then we can check if the rest is also a valid word. Then we can consider that a merged word error has been detected. It is corrected by inserting a space in between these two words. 6

7 W not in word list Find correction candidates Rank the candidates Input W Present the user with best 5 candidates W is present in word list Declare valid word 7

8 S P L L C H C K Substitution 8

9 S P L X C H C K Substitution 9

10 S P L L C H C K Deletion 10

11 S P L L H C K Deletion 11

12 S P L L H C K Deletion 12

13 S P L L C H C K Insertion 13

14 S P L L X C H C K Insertion 14

15 S P L L C H C K Transposition 15

16 S P L C L H C K Transposition The substitution and transposition can be composed of multiple insertion and deletion, which are basic operations. 16

17 1. Language issue : Word morphology - Degree of inflectionality. Diglossia, cho word, Onamotopoea. 2. Script issue : lphabet size - Character shape, Presence of vowel modifier and Compound character. 3. Spelling issue : lternative spelling, Standardization of spelling. 17

18 4. rror Pattern Issue : Single vs. multiple, Substitution, Deletion, Insertion, Transposition. Phonetic/Graphemic similarity. Other tendencies. 5. pplication rea Issue : () Subject based: Newspaper text, Official letters, notes and report preparation, Technical book writing, Story & Novel writing. (B) Technology output based : OCR output, Speech recognition output, Braille to text output. 18

19 String of length n can have 2n+1 error/correction positions rror positions Original word C R B O N Word Character posn Odd numbered position : Substitution, Deletion (single) ven numbered position: one or more Insertions Substitution at position 11 : CRBOL Double insertion at position 10 : CRBOYLN Deletion at position 7 & insertion at position 12 : CRONS 19

20 Dictionary Look up: For a word in the document, check if it is listed in the dictionary, If yes, pass it as a valid word. lse, indicate that it is incorrect word and provide suggestions. N-Gram: Store all possible N-grams in a N-dimensional array. For a word in the document, check if all its N-grams are there in the array. If yes, pass it as a valid word. lse, indicate that it is a incorrect word and generate suggestions. (Useful for OCR error correction) Morphological analysis: It is almost impossible to generate a dictionary containing all inflected words. Morphological analysis is used to strip the suffix, verify the root word and check if the suffix morphologically agrees with the root word. If yes, the word passes as valid one. lse, it is stopped as incorrect word. 20

21 Minimum dit Distance: The minimum number of editing operations (Insertion, Deletion, Substitution, Transpositions) needed for converting one string of characters into a valid word. Proposed by Damaraeu and Levensthein and goes under their name. Needs dynamic programming to compute. rror Correction pproach: Find minimum dit distance of the misspelled string from all words in dictionary. Those having least edit distance are the suggested words for correction. Reversed dit Distance: The above approach needs computations of dit Distances on the whole dictionary. n easier approach is Reverse edit distance where the error string is converted by editing operations and the resulting strings are tested for valid words in the dictionary. 21

22 (a) Similarity Key technique: xploits phonetic similarity between misspelled string and intended word. (spell software partially uses this approach) (b) Rule-based technique: Some spelling error patterns can be represented in the form of rules. This class of techniques tries to build a kind of xpert system. (c) N-gram based techniques: Tries to replace the impossible bigrams or trigrams by possible ones and check if this task makes a valid word or not. s stated before, it is more potential for OCR error correction. (d) Probabilistic technique: Tries to exploit Bayes rule as well as Transition probability and Confusion probability. (e) Neural Net and evolutionary computing: Multi-layer perceptron is trained with erroneous string vs. valid word. (f) Word Trigram based error correction: Church and Gale (1991), Brill and Moore (2000) used word trigram library to choose and rank the suggestion words. 22

23 Let G and O be a dictionary word and the typed string, respectively. If length of G is n characters and length of O is m characters, then the edit distance D(i, j) is recursively computed as D(i,j) = Min [D(i-1,j) + C d (G i ), D(i-1,j-1) + C s (O j, G i ), D(i,j-1) + C i (O j )] Where, D(0,0) is initialized to zero. C s = substitution cost, is zero if O j = G i and is 1, otherwise. C d = deletion cost = 1 C i = insertion cost =1. D(n, m) is the Minimum dit Distance between O and G. 23

24 D- Deletion, M- Match, R- Replacement, I- Insertion 24

25 25

26 From the original dictionary D, a reversed dictionary D r is formed. If COPY is a word, then YPOC is its reversed version. ll words of D are reversed to get D r, which is alphabetically ordered. In general, W i r (j) = W i (L i 1 j ) for 0 j L i 1 lso 1 shifted dictionary D 1 is formed by shifting 1-st character of words in D to the last position W i 1 (j) = W i ( j 1 ) for 1 j L i & W i1 (L i ) = W i ( 1 ) Similarly 2 shifted dictionary D 2 is formed by shifting 1-st character of words in D to the last position W i 2 (j) = W i1 ( j 1 ) for 1 j L i & W i2 (L i ) = W i 1 ( 1 ) 26

27 D o Original Word SPLLCHCK CORRCTION BSOLUT D r Reversed Word KCHCLLPS NOITCRROC TULOSB D 1 1 Character shift PLLCHCKS ORRCTIONC BSOLUT D 2 2 Character shift LLCHCKSP RRCTIONCN SOLUTB 27

28 For quick search and access, the dictionaries are arranged in a trie structure. Trie comes from the word retrieval whose proposer is dward Fredkin. Trie is a tree-like structure where each node corresponds to a character of the dictionary and branch shows the sequence of characters in the word. To economize space we can combine tries for D o, D r, D 1 and D 2 into a single multi-trie structure. 28

29 .g. If the wordlist has the entries stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red color. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 29 B. B. CHUDHURI, CVPR UNIT

30 D M O N S P R T I O N (a) n D M O N S P R T I O N Search stopped after this in D n D M O N S P R T I O N (b) (c) (a) Wrong Word string S (b) Forward Dictionary (D) search (c) Reversed Dictionary (D r ) search Search stopped after this in D r rror Zone 30

31 rror zone length % of (in no. of characters) strings rror located at either end of error zone Prof. B. B. Chaudhuri, Indian Statistical Institute, Kolkata

32 1 2 rroneous string S is partitioned into two equal regions (1) and (2). Let, their lengths be n. If (1) is error-free then find valid words W 1 in the dictionary D o of length n, n+1, n-1. If (2) is error-free then find valid words W 2 in the reversed dictionary D r of length n, n+1, n-1. Union of W 1 and W 2 is the list of candidate words. This approach reduces the amount of search. The method can be extended to two-position and more errors as well. (How?) 32

33 1 2 3 Case 1: Both errors are in region (2) & (3). Hence region (1) is error-free. Use original dictionary D 0 for correction candidate. Case 2: Both errors are in region (1) & (2). Hence region (3) is error-free. Use reversed word dictionary D r for finding correction candidate. Case 3: One error is in (1) & other in (3). So (2) is error-free. For this case we need dictionaries D 1 and D 2 on modified input string. 33

34 For the wordlist stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red and D 1 in violet colour. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 34 B. B. CHUDHURI, CVPR UNIT

35 B S O L U T ctions to be taken B S O L U T X Zero shift Search in D 1 Z B S O L U T X One Search in D 0 Z B S O L U T X character to Search in D 1 S O L U T X right end Search in D 2 Z B S O L U T Z S O L U T X X Two characters to right end Search in D 1 Search in D 2 35

36 Step 1: If the current test word W exist in dictionary D, go to 6. else, paint error color on the word and continue. Step 2: Test W in D and D r Trie for 1-error suggestion generation. If 5 suggestions are found, go to step 5. Step 3: Test W in D, D r, D 1, D 2 Trie for 2-error suggestion generation, collect the suggestions and go to step 5. Step 4: If suggestion list is empty, display NO SUGGSTION. Go to step 6. Step 5: Use Phonetic similarity, keyboard neighborhood and word popularity to rank the suggestions. Display at most 5 ranked words. Step 6: if W is the last string, XIT. lse, take next word from the input file and go to step 1. 36

37 Space bar Candidate key: D First order neighboring keys: S, F Second order neighboring keys: X, C and 37

38 Neighboring Key Character weight: Weight Normalized Weight 1 st -order neighbor 5 5/9 2 nd -order neighbor 3 3/9 Other Characters 1 1/9 Phonetic similarities based weight: Phonetically similar characters 3 3/4 Other Characters 1 1/4 diting operation based weight: Candidate generated by substitution 3 3/4 Candidate generated by insertion/deletion 1 1/4 Word statistics based weights: a) First order: If k suggestions are generated, rank them according to their prior probability in corpus. The top rank gets weight k (normalized as k/ k), next one get weight k-1 (normalized as (k-1)/ k) and so on. b) Second order: Conditional probability (bigram) is employed to generate weight. These weights are linearly combined to get the score for each suggestion word. The words are then ranked according to this combined weight and displayed. 38

39 1. Declares correct words as erroneous ones (some language specific cases also. Like echo-form chai wei in Hindi). 2. Detects error, but fails to suggest alternative words. 3. Detects error and suggests alternatives but the top suggestions do not contain the intended word. 4. Detects error and suggests alternatives but they do not include the intended word. 5. Fails to detect run-on and split word error. 6. Fails to detect real-word error. 39

40 Indian language writing systems have more characters and modifiers (more than double w.r.t. nglish). lso, a joiner is needed to form compound character. So, the number of nodes in the Tries for dictionary is increased. So does the search time. To reduce the searching space we can club similarly sounding characters into single symbol. lso, a vowel and its modifier can be given single tag. For each character there is a set of distinct characters that can follow it. This is true for compound character as well. Such information can be stored in the form of a table and hence the Trie traversal can be made more efficient. 40

41 Useful for similar sounding character substitution error detection and correction. Club Long and short vowels (u, U; i, I etc.) and Consonants (r, R; n, N; s, S; J) having phonetically similar utterance into single entities. Re-organize this Semi-Phonetic dictionary into some semialphabetic ordering. Pointer is kept to all valid graphemic words for a Phonetic word. If the error is purely phonetic Substitution, then it can be easily detected and corrected using this dictionary. 41

42 म म र मण न ल र ढ मह र म म र मननननल र र मह सर 42

43 43

44 I invite all students, researchers and faculty members present here to work for the development of Indian language technology. More specifically I would request you to develop basic tools like spell-checker, real word error corrector, electronic thesaurus and word net in Indian languages. Thank You 44

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm

More information

UNIT-4 (COMPILER DESIGN)

UNIT-4 (COMPILER DESIGN) UNIT-4 (COMPILER DESIGN) An important part of any compiler is the construction and maintenance of a dictionary containing names and their associated values, such type of dictionary is called a symbol table.

More information

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation Outline Introduction to Parsing Lecture 8 Adapted from slides by G. Necula and R. Bodik Limitations of regular languages Parser overview Context-free grammars (CG s) Derivations Syntax-Directed ranslation

More information

Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier

Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier N. Sharma, U. Pal*, F. Kimura**, and S. Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute

More information

Knowledge Based Information Retrieval for Syntactic analysis of Kannada Script

Knowledge Based Information Retrieval for Syntactic analysis of Kannada Script 98 Knowledge Based Information Retrieval for Syntactic analysis of Kannada Script Keshava Prasanna 1 Dr Ramakhanth Kumar P 2 Thungamani.M 3 ShravaniKrishna Rau 4 1,3 ResearchAssistant, Tumkur University,

More information

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction Measurements of the effect of linear interpolation s and reduced bigram model size for text prediction Marit Ånestad Lunds Tekniska Högskola man039@post.uit.no Michael Geier Lunds Tekniska Högskola michael.geier@student.tugraz.at

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 94-95 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity Methods

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by me and Chris Manning) Stanford University Unix for Poets Text is everywhere The Web

More information

New Concept based Indexing Technique for Search Engine

New Concept based Indexing Technique for Search Engine Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/114018, May 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 New Concept based Indexing Technique for Search

More information

CSCI3381-Cryptography

CSCI3381-Cryptography CSCI3381-Cryptography Project 1: Automated Cryptanalysis of Monoalphabetic Substitution Cipher September 3, 2014 There s not much in the way of modern cryptography in this project (it probably has more

More information

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by Chris Manning) Stanford University Unix for Poets (based on Ken Church s presentation)

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 95-96 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity (Matching)

More information

Layout Segmentation of Scanned Newspaper Documents

Layout Segmentation of Scanned Newspaper Documents , pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms

More information

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 91-95 Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network Raghuraj

More information

System 44 Next Generation Software Manual

System 44 Next Generation Software Manual System 44 Next Generation Software Manual For use with System 44 Next Generation version 2.4 or later and Student Achievement Manager version 2.4 or later PDF0836 (PDF) Houghton Mifflin Harcourt Publishing

More information

Feedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9

Feedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9 Feedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9 S. Page Remarks Concern Status No. Version No. 1 1.4.9 Test Report of Akshara, is missing. Pending Pl. check Annexure 5: Definition of

More information

Jan Pedersen 22 July 2010

Jan Pedersen 22 July 2010 Jan Pedersen 22 July 2010 Outline Problem Statement Best effort retrieval vs automated reformulation Query Evaluation Architecture Query Understanding Models Data Sources Standard IR Assumptions Queries

More information

Code No: R Set No. 1

Code No: R Set No. 1 Code No: R05010106 Set No. 1 1. (a) Draw a Flowchart for the following The average score for 3 tests has to be greater than 80 for a candidate to qualify for the interview. Representing the conditional

More information

Detecting code re-use potential

Detecting code re-use potential Detecting code re-use potential Mario Konecki, Tihomir Orehovački, Alen Lovrenčić Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 Varaždin, Croatia {mario.konecki, tihomir.orehovacki,

More information

CSCI 5582 Artificial Intelligence. Today 10/31

CSCI 5582 Artificial Intelligence. Today 10/31 CSCI 5582 Artificial Intelligence Lecture 17 Jim Martin Today 10/31 HMM Training (EM) Break Machine Learning 1 Urns and Balls Π Urn 1: 0.9; Urn 2: 0.1 A Urn 1 Urn 2 Urn 1 Urn 2 0.6 0.3 0.4 0.7 B Urn 1

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Online Bangla Handwriting Recognition System

Online Bangla Handwriting Recognition System 1 Online Bangla Handwriting Recognition System K. Roy Dept. of Comp. Sc. West Bengal University of Technology, BF 142, Saltlake, Kolkata-64, India N. Sharma, T. Pal and U. Pal Computer Vision and Pattern

More information

1. Lexical Analysis Phase

1. Lexical Analysis Phase 1. Lexical Analysis Phase The purpose of the lexical analyzer is to read the source program, one character at time, and to translate it into a sequence of primitive units called tokens. Keywords, identifiers,

More information

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available:

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available: Trees in java.util A set is an object that stores unique elements In Java, two implementations are available: The class HashSet implements the set with a hash table and a hash function The class TreeSet,

More information

Creation of a Complete Hindi Handwritten Database for Researchers

Creation of a Complete Hindi Handwritten Database for Researchers Journal of Pure and Applied Science & Technology Copyright 2011 NLSS, Vol. 8(1), Jan 2018, pp. 52-60 Creation of a Complete Hindi Handwritten Database for Researchers Rama Gaur 1, * and Dr. V.S. Chouhan

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)

More information

Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules

Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules Publication Date: 20 October 2018 Prepared By: IDN Program, ICANN Org Public Comment Proceeding Open Date: 27 July

More information

Math 4410 Fall 2010 Exam 3. Show your work. A correct answer without any scratch work or justification may not receive much credit.

Math 4410 Fall 2010 Exam 3. Show your work. A correct answer without any scratch work or justification may not receive much credit. Math 4410 Fall 2010 Exam 3 Name: Directions: Complete all six questions. Show your work. A correct answer without any scratch work or justification may not receive much credit. You may not use any notes,

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Naming Jia Rao http://ranger.uta.edu/~jrao/ 1 Naming Names play a critical role in all computer systems To access resources, uniquely identify entities, or refer to locations

More information

CSC D70: Compiler Optimization Register Allocation

CSC D70: Compiler Optimization Register Allocation CSC D70: Compiler Optimization Register Allocation Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons

More information

Subject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100

Subject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100 Code: DC-05 Subject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100 NOTE: There are 11 Questions in all. Question 1 is compulsory and carries 16 marks. Answer to Q. 1. must be written in the space

More information

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file mispellings speller.c 1. calls load on the dictionary file dictionary contains valid words, one per line 2. calls check on each word in the text file and prints all misspelled words 3. calls size to determine

More information

Correlation to Georgia Quality Core Curriculum

Correlation to Georgia Quality Core Curriculum 1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens

More information

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn University 14 July 2012 Outline Introduction 1 Introduction 2 3 4 Transition Diagrams Learning Objectives Understand definition of

More information

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation K. Roy, U. Pal and B. B. Chaudhuri CVPR Unit; Indian Statistical Institute, Kolkata-108; India umapada@isical.ac.in

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

A simple noise model. Algorithm sketch. A simple noise model. Estimating the probabilities

A simple noise model. Algorithm sketch. A simple noise model. Estimating the probabilities Recap: noisy channel model Foundations of Natural anguage Processing ecture 6 pelling correction, edit distance, and EM lex ascarides (lides from lex ascarides and haron Goldwater) 1 February 2019 general

More information

Searching a Sorted Set of Strings

Searching a Sorted Set of Strings Department of Mathematics and Computer Science January 24, 2017 University of Southern Denmark RF Searching a Sorted Set of Strings Assume we have a set of n strings in RAM, and know their sorted order

More information

Handwritten Script Recognition at Block Level

Handwritten Script Recognition at Block Level Chapter 4 Handwritten Script Recognition at Block Level -------------------------------------------------------------------------------------------------------------------------- Optical character recognition

More information

Index-assisted approximate matching

Index-assisted approximate matching Index-assisted approximate matching Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email

More information

OCR For Handwritten Marathi Script

OCR For Handwritten Marathi Script International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 OCR For Handwritten Marathi Script Mrs.Vinaya. S. Tapkir 1, Mrs.Sushma.D.Shelke 2 1 Maharashtra Academy Of Engineering,

More information

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris Outline Shortcomings in the old spell(1) Feature Requirements of a modern spell(1) Implementation Details of new spell(1)

More information

Using Microsoft Word. Text Tools. Spell Check

Using Microsoft Word. Text Tools. Spell Check Using Microsoft Word Text Tools In addition to the editing tools covered in the previous section, Word has a number of other tools to assist in working with text documents. There are tools to help you

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

OCR Coverage. Open Court Reading Grade K CCSS Correlation

OCR Coverage. Open Court Reading Grade K CCSS Correlation Grade K Common Core State Standards Reading: Literature Key Ideas and Details RL.K.1 With prompting and support, ask and answer questions about key details in a text. OCR Coverage Unit 1: T70 Unit 2: T271,

More information

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Parsing Ambiguity and Syntax Errors Introduction to Parsing Ambiguity and Syntax rrors Outline Regular languages revisited Parser overview Context-free grammars (CFG s) Derivations Ambiguity Syntax errors Compiler Design 1 (2011) 2 Languages

More information

Propositional Logic. Part I

Propositional Logic. Part I Part I Propositional Logic 1 Classical Logic and the Material Conditional 1.1 Introduction 1.1.1 The first purpose of this chapter is to review classical propositional logic, including semantic tableaux.

More information

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern

Chapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern Chapter 3 Image Processing Methods The Role of Image Processing Methods (1) An image is an nxn matrix of gray or color values An image processing method is algorithm transforming such matrices or assigning

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Semantic image search using queries

Semantic image search using queries Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,

More information

Introduction to: Computers & Programming: Strings and Other Sequences

Introduction to: Computers & Programming: Strings and Other Sequences Introduction to: Computers & Programming: Strings and Other Sequences in Python Part I Adam Meyers New York University Outline What is a Data Structure? What is a Sequence? Sequences in Python All About

More information

Cambridge GRADE 4 Semester 2 nd EXAMINATIONS (1st February 2019)

Cambridge GRADE 4 Semester 2 nd EXAMINATIONS (1st February 2019) Page 1 of 6 Cambridge GRADE 4 Semester 2 nd EXAMINATIONS (1st February 2019) SUBJECT First Language ENGLISH PAPER TITLE ENGLISH PAPER 1 Non-Fiction TIME: 1 hour MARKS: 50 marks SECTION A: Reading Comprehension(

More information

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Parsing Ambiguity and Syntax Errors Introduction to Parsing Ambiguity and Syntax rrors Outline Regular languages revisited Parser overview Context-free grammars (CFG s) Derivations Ambiguity Syntax errors 2 Languages and Automata Formal

More information

The Kinect Sensor. Luís Carriço FCUL 2014/15

The Kinect Sensor. Luís Carriço FCUL 2014/15 Advanced Interaction Techniques The Kinect Sensor Luís Carriço FCUL 2014/15 Sources: MS Kinect for Xbox 360 John C. Tang. Using Kinect to explore NUI, Ms Research, From Stanford CS247 Shotton et al. Real-Time

More information

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2

More information

Students are placed in System 44 based on their performance in the Scholastic Phonics Inventory. System 44 Placement and Scholastic Phonics Inventory

Students are placed in System 44 based on their performance in the Scholastic Phonics Inventory. System 44 Placement and Scholastic Phonics Inventory System 44 Overview The System 44 student application leads students through a predetermined path to learn each of the 44 sounds and the letters or letter combinations that create those sounds. In doing

More information

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct 1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 8, 12 Oct. 2016 jtl@ifi.uio.no Today 2 Preparing bitext Parameter tuning Reranking Some linguistic issues STMT so far 3 We

More information

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file mispellings speller.c 1. calls load on the dictionary file dictionary contains valid words, one per line 2. calls check on each word in the text file and prints all misspelled words 3. calls size to determine

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

A Document Image Analysis System on Parallel Processors

A Document Image Analysis System on Parallel Processors A Document Image Analysis System on Parallel Processors Shamik Sural, CMC Ltd. 28 Camac Street, Calcutta 700 016, India. P.K.Das, Dept. of CSE. Jadavpur University, Calcutta 700 032, India. Abstract This

More information

Search Engines. Gertjan van Noord. September 17, 2018

Search Engines. Gertjan van Noord. September 17, 2018 Search Engines Gertjan van Noord September 17, 2018 About the course Information about the course is available from: http://www.let.rug.nl/vannoord/college/zoekmachines/ Last week Normalization (case,

More information

Computer Algorithms-2 Prof. Dr. Shashank K. Mehta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Computer Algorithms-2 Prof. Dr. Shashank K. Mehta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Computer Algorithms-2 Prof. Dr. Shashank K. Mehta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Lecture - 6 Minimum Spanning Tree Hello. Today, we will discuss an

More information

Title bar: The top most bar in Word window that usually displays the document and software names.

Title bar: The top most bar in Word window that usually displays the document and software names. 1 MICROSOFT WORD Table of Contents LINC ONE Hiding Standard toolbar, Formatting toolbar, and Status bar: To hide the Standard toolbar, click View Toolbars on the Menu bar. Check off Standard. To hide the

More information

Khmer OCR for Limon R1 Size 22 Report

Khmer OCR for Limon R1 Size 22 Report PAN Localization Project Project No: Ref. No: PANL10n/KH/Report/phase2/002 Khmer OCR for Limon R1 Size 22 Report 09 July, 2009 Prepared by: Mr. ING LENG IENG Cambodia Country Component PAN Localization

More information

Omni Dictionary USER MANUAL ENGLISH

Omni Dictionary USER MANUAL ENGLISH Omni Dictionary USER MANUAL ENGLISH Table of contents Power and battery 3 1.1. Power source 3 1.2 Resetting the Translator 3 2. The function of keys 4 3. Start Menu 7 3.1 Menu language 8 4. Common phrases

More information

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 2. O(n) 2. [1 pt] What is the solution to the recurrence T(n) = T(n/2) + n, T(1)

More information

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1 Finite-State and the Noisy Channel 600.465 - Intro to NLP - J. Eisner 1 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing

More information

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

A Framework for Efficient Fingerprint Identification using a Minutiae Tree A Framework for Efficient Fingerprint Identification using a Minutiae Tree Praveer Mansukhani February 22, 2008 Problem Statement Developing a real-time scalable minutiae-based indexing system using a

More information

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one

More information

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous. Section A 1. What do you meant by parser and its types? A parser for grammar G is a program that takes as input a string w and produces as output either a parse tree for w, if w is a sentence of G, or

More information

City, University of London Institutional Repository

City, University of London Institutional Repository City Research Online City, University of London Institutional Repository Citation: Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S. & Betz, H-D. (2015). Real Time Detection and Tracking of Spatial

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Opportunities and Challenges of Handwritten Sanskrit Character Recognition System

Opportunities and Challenges of Handwritten Sanskrit Character Recognition System Opportunities and Challenges of Handwritten System Shailendra Kumar Singh Research Scholar, CSE Department SLIET Longowal, Sangrur, Punjab, India Sks.it2012@gmail.com Manoj Kumar Sachan Assosiate Professor,

More information

Adding Source Code Searching Capability to Yioop

Adding Source Code Searching Capability to Yioop Adding Source Code Searching Capability to Yioop Advisor - Dr Chris Pollett Committee Members Dr Sami Khuri and Dr Teng Moh Presented by Snigdha Rao Parvatneni AGENDA Introduction Preliminary work Git

More information

In fact, in many cases, one can adequately describe [information] retrieval by simply substituting document for information.

In fact, in many cases, one can adequately describe [information] retrieval by simply substituting document for information. LµŒ.y A.( y ý ó1~.- =~ _ _}=ù _ 4.-! - @ \{=~ = / I{$ 4 ~² =}$ _ = _./ C =}d.y _ _ _ y. ~ ; ƒa y - 4 (~šƒ=.~². ~ l$ y C C. _ _ 1. INTRODUCTION IR System is viewed as a machine that indexes and selects

More information

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary

More information

Predecessor. Predecessor Problem van Emde Boas Tries. Philip Bille

Predecessor. Predecessor Problem van Emde Boas Tries. Philip Bille Predecessor Predecessor Problem van Emde Boas Tries Philip Bille Predecessor Predecessor Problem van Emde Boas Tries Predecessors Predecessor problem. Maintain a set S U = {,..., u-} supporting predecessor(x):

More information

Vision Impairment and Computing

Vision Impairment and Computing These notes are intended to introduce the major approaches to computing for people with impaired vision. These approaches can be used singly or in combination to enable a visually impaired person to use

More information

Static Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1

Static Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1 Static Semantics Lecture 15 (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1 Current Status Lexical analysis Produces tokens Detects & eliminates illegal tokens Parsing

More information

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others The Big Picture Character Stream Scanner

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

Spelling Corrector for Android Project

Spelling Corrector for Android Project Spelling Corrector for Android Project Introduction You are familiar with spell checkers. For most spell checkers, a candidate word is considered to be spelled correctly if it is found in a long list of

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

Parallel Sorting Algorithms

Parallel Sorting Algorithms Parallel Sorting Algorithms Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2016/2017 (Slides based on the book Parallel Programming:

More information

Using Microsoft Word. Text Tools. Spell Check

Using Microsoft Word. Text Tools. Spell Check Using Microsoft Word In addition to the editing tools covered in the previous section, Word has a number of other tools to assist in working with test documents. There are tools to help you find and correct

More information

CS 61B Summer 2005 (Porter) Midterm 2 July 21, SOLUTIONS. Do not open until told to begin

CS 61B Summer 2005 (Porter) Midterm 2 July 21, SOLUTIONS. Do not open until told to begin CS 61B Summer 2005 (Porter) Midterm 2 July 21, 2005 - SOLUTIONS Do not open until told to begin This exam is CLOSED BOOK, but you may use 1 letter-sized page of notes that you have created. Problem 0:

More information

Just Sort. Sathish Kumar Vijayakumar Chennai, India (1)

Just Sort. Sathish Kumar Vijayakumar Chennai, India (1) Just Sort Sathish Kumar Vijayakumar Chennai, India satthhishkumar@gmail.com Abstract Sorting is one of the most researched topics of Computer Science and it is one of the essential operations across computing

More information

Error annotation in adjective noun (AN) combinations

Error annotation in adjective noun (AN) combinations Error annotation in adjective noun (AN) combinations This document describes the annotation scheme devised for annotating errors in AN combinations and explains how the inter-annotator agreement has been

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Introduction to Hidden Markov models

Introduction to Hidden Markov models 1/38 Introduction to Hidden Markov models Mark Johnson Macquarie University September 17, 2014 2/38 Outline Sequence labelling Hidden Markov Models Finding the most probable label sequence Higher-order

More information

Bash command shell language interpreter

Bash command shell language interpreter Principles of Programming Languages Bash command shell language interpreter Advanced seminar topic Louis Sugy & Baptiste Thémine Presentation on December 8th, 2017 Table of contents I. General information

More information

WordPsychic. User s Manual. InvoTek, Inc Riverview Drive Alma, AR (479)

WordPsychic. User s Manual. InvoTek, Inc Riverview Drive Alma, AR (479) WordPsychic User s Manual InvoTek, Inc. 1026 Riverview Drive Alma, AR 72921 (479) 632-4166 support@invotek.org version 1.0.1 June 7, 2012 Copyright InvoTek Inc 2012 System Requirements 3 Installation 3

More information

Comparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System

Comparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System Comparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System Jasbir Singh Department of Computer Science Punjabi University Patiala, India

More information

CS1100 Introduction to Programming

CS1100 Introduction to Programming Decisions with Variables CS1100 Introduction to Programming Selection Statements Madhu Mutyam Department of Computer Science and Engineering Indian Institute of Technology Madras Course Material SD, SB,

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Tree Data Structures CSC 221

Tree Data Structures CSC 221 Tree Data Structures CSC 221 BSTree Deletion - Merging template // LOOK AT THIS PARAMETER!!! void BST::deleteByMerging(BSTNode* & nodepointer) { BSTNode* temp= nodepointer;

More information

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

CS4442/9542b Artificial Intelligence II prof. Olga Veksler CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture 15 Natural Language Processing Spelling Correction Many slides from: D. Jurafsky, C. Manning Types of spelling errors Outline 1. non word

More information