Non-word Error Detection and Correction
|
|
- Maximillian Hancock
- 6 years ago
- Views:
Transcription
1 Non-word rror Detection and Correction Prof. Bidyut B. Chaudhuri J. C. Bose Fellow & Head CVPR Unit, Indian Statistical Statistics Kolkata
2 2
3 Word Mis-typing or Unknown Spelling Real word rror Non-word error Syntax anomaly, Semantic anomaly Nonsensical situation 3
4 Low level task (Spell-checker) Find incorrect words ( Non-word errors ) Suggest correct alternatives and rank them. Correct automatically / interactively. High level tasks (Real-word error correction) Find lexically correct but syntactically and semantically incorrect words ( Real word errors) Suggest correct alternatives and rank them. Correct automatically. Some Spell check software in nglish UNIX spell, spell, Grope, CLR, SPDCOP, Spellex etc. 4
5 1. Split word: When a space is wrongly inserted within the word. 2. Run-on or merged words: When the space between two or more words are not inserted. 3. Character Insertion, Deletion and Substitution (IDS) :When one or more character are substituted or deleted. lso, when a character is inserted in the word. The split word and run-on errors are to be checked before going for IDS error correction. Usually the character string is checked in a word list (dictionary). If there is no match, this string is not a valid word. Then the correction effort is started. 5
6 If (according to dictionary check) two consecutive strings are nonwords then we can merge them and check the merged string in the dictionary. If it is a valid word, then split word error has been detected and corrected. If one string is not a valid word, we can see if a portion of this (from left side) is a valid word. If yes, then we can check if the rest is also a valid word. Then we can consider that a merged word error has been detected. It is corrected by inserting a space in between these two words. 6
7 W not in word list Find correction candidates Rank the candidates Input W Present the user with best 5 candidates W is present in word list Declare valid word 7
8 S P L L C H C K Substitution 8
9 S P L X C H C K Substitution 9
10 S P L L C H C K Deletion 10
11 S P L L H C K Deletion 11
12 S P L L H C K Deletion 12
13 S P L L C H C K Insertion 13
14 S P L L X C H C K Insertion 14
15 S P L L C H C K Transposition 15
16 S P L C L H C K Transposition The substitution and transposition can be composed of multiple insertion and deletion, which are basic operations. 16
17 1. Language issue : Word morphology - Degree of inflectionality. Diglossia, cho word, Onamotopoea. 2. Script issue : lphabet size - Character shape, Presence of vowel modifier and Compound character. 3. Spelling issue : lternative spelling, Standardization of spelling. 17
18 4. rror Pattern Issue : Single vs. multiple, Substitution, Deletion, Insertion, Transposition. Phonetic/Graphemic similarity. Other tendencies. 5. pplication rea Issue : () Subject based: Newspaper text, Official letters, notes and report preparation, Technical book writing, Story & Novel writing. (B) Technology output based : OCR output, Speech recognition output, Braille to text output. 18
19 String of length n can have 2n+1 error/correction positions rror positions Original word C R B O N Word Character posn Odd numbered position : Substitution, Deletion (single) ven numbered position: one or more Insertions Substitution at position 11 : CRBOL Double insertion at position 10 : CRBOYLN Deletion at position 7 & insertion at position 12 : CRONS 19
20 Dictionary Look up: For a word in the document, check if it is listed in the dictionary, If yes, pass it as a valid word. lse, indicate that it is incorrect word and provide suggestions. N-Gram: Store all possible N-grams in a N-dimensional array. For a word in the document, check if all its N-grams are there in the array. If yes, pass it as a valid word. lse, indicate that it is a incorrect word and generate suggestions. (Useful for OCR error correction) Morphological analysis: It is almost impossible to generate a dictionary containing all inflected words. Morphological analysis is used to strip the suffix, verify the root word and check if the suffix morphologically agrees with the root word. If yes, the word passes as valid one. lse, it is stopped as incorrect word. 20
21 Minimum dit Distance: The minimum number of editing operations (Insertion, Deletion, Substitution, Transpositions) needed for converting one string of characters into a valid word. Proposed by Damaraeu and Levensthein and goes under their name. Needs dynamic programming to compute. rror Correction pproach: Find minimum dit distance of the misspelled string from all words in dictionary. Those having least edit distance are the suggested words for correction. Reversed dit Distance: The above approach needs computations of dit Distances on the whole dictionary. n easier approach is Reverse edit distance where the error string is converted by editing operations and the resulting strings are tested for valid words in the dictionary. 21
22 (a) Similarity Key technique: xploits phonetic similarity between misspelled string and intended word. (spell software partially uses this approach) (b) Rule-based technique: Some spelling error patterns can be represented in the form of rules. This class of techniques tries to build a kind of xpert system. (c) N-gram based techniques: Tries to replace the impossible bigrams or trigrams by possible ones and check if this task makes a valid word or not. s stated before, it is more potential for OCR error correction. (d) Probabilistic technique: Tries to exploit Bayes rule as well as Transition probability and Confusion probability. (e) Neural Net and evolutionary computing: Multi-layer perceptron is trained with erroneous string vs. valid word. (f) Word Trigram based error correction: Church and Gale (1991), Brill and Moore (2000) used word trigram library to choose and rank the suggestion words. 22
23 Let G and O be a dictionary word and the typed string, respectively. If length of G is n characters and length of O is m characters, then the edit distance D(i, j) is recursively computed as D(i,j) = Min [D(i-1,j) + C d (G i ), D(i-1,j-1) + C s (O j, G i ), D(i,j-1) + C i (O j )] Where, D(0,0) is initialized to zero. C s = substitution cost, is zero if O j = G i and is 1, otherwise. C d = deletion cost = 1 C i = insertion cost =1. D(n, m) is the Minimum dit Distance between O and G. 23
24 D- Deletion, M- Match, R- Replacement, I- Insertion 24
25 25
26 From the original dictionary D, a reversed dictionary D r is formed. If COPY is a word, then YPOC is its reversed version. ll words of D are reversed to get D r, which is alphabetically ordered. In general, W i r (j) = W i (L i 1 j ) for 0 j L i 1 lso 1 shifted dictionary D 1 is formed by shifting 1-st character of words in D to the last position W i 1 (j) = W i ( j 1 ) for 1 j L i & W i1 (L i ) = W i ( 1 ) Similarly 2 shifted dictionary D 2 is formed by shifting 1-st character of words in D to the last position W i 2 (j) = W i1 ( j 1 ) for 1 j L i & W i2 (L i ) = W i 1 ( 1 ) 26
27 D o Original Word SPLLCHCK CORRCTION BSOLUT D r Reversed Word KCHCLLPS NOITCRROC TULOSB D 1 1 Character shift PLLCHCKS ORRCTIONC BSOLUT D 2 2 Character shift LLCHCKSP RRCTIONCN SOLUTB 27
28 For quick search and access, the dictionaries are arranged in a trie structure. Trie comes from the word retrieval whose proposer is dward Fredkin. Trie is a tree-like structure where each node corresponds to a character of the dictionary and branch shows the sequence of characters in the word. To economize space we can combine tries for D o, D r, D 1 and D 2 into a single multi-trie structure. 28
29 .g. If the wordlist has the entries stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red color. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 29 B. B. CHUDHURI, CVPR UNIT
30 D M O N S P R T I O N (a) n D M O N S P R T I O N Search stopped after this in D n D M O N S P R T I O N (b) (c) (a) Wrong Word string S (b) Forward Dictionary (D) search (c) Reversed Dictionary (D r ) search Search stopped after this in D r rror Zone 30
31 rror zone length % of (in no. of characters) strings rror located at either end of error zone Prof. B. B. Chaudhuri, Indian Statistical Institute, Kolkata
32 1 2 rroneous string S is partitioned into two equal regions (1) and (2). Let, their lengths be n. If (1) is error-free then find valid words W 1 in the dictionary D o of length n, n+1, n-1. If (2) is error-free then find valid words W 2 in the reversed dictionary D r of length n, n+1, n-1. Union of W 1 and W 2 is the list of candidate words. This approach reduces the amount of search. The method can be extended to two-position and more errors as well. (How?) 32
33 1 2 3 Case 1: Both errors are in region (2) & (3). Hence region (1) is error-free. Use original dictionary D 0 for correction candidate. Case 2: Both errors are in region (1) & (2). Hence region (3) is error-free. Use reversed word dictionary D r for finding correction candidate. Case 3: One error is in (1) & other in (3). So (2) is error-free. For this case we need dictionaries D 1 and D 2 on modified input string. 33
34 For the wordlist stral, ztec, Cerulean, Cereal, Lame, Name. Here D o edges are in blue, D r in red and D 1 in violet colour. root C L N S Z S T R L Z T C M L N R T U L N L Z M R T R S C M L U R C R M U L L C N C L N T R L T C INDIN STTISTICL INSTITUT 34 B. B. CHUDHURI, CVPR UNIT
35 B S O L U T ctions to be taken B S O L U T X Zero shift Search in D 1 Z B S O L U T X One Search in D 0 Z B S O L U T X character to Search in D 1 S O L U T X right end Search in D 2 Z B S O L U T Z S O L U T X X Two characters to right end Search in D 1 Search in D 2 35
36 Step 1: If the current test word W exist in dictionary D, go to 6. else, paint error color on the word and continue. Step 2: Test W in D and D r Trie for 1-error suggestion generation. If 5 suggestions are found, go to step 5. Step 3: Test W in D, D r, D 1, D 2 Trie for 2-error suggestion generation, collect the suggestions and go to step 5. Step 4: If suggestion list is empty, display NO SUGGSTION. Go to step 6. Step 5: Use Phonetic similarity, keyboard neighborhood and word popularity to rank the suggestions. Display at most 5 ranked words. Step 6: if W is the last string, XIT. lse, take next word from the input file and go to step 1. 36
37 Space bar Candidate key: D First order neighboring keys: S, F Second order neighboring keys: X, C and 37
38 Neighboring Key Character weight: Weight Normalized Weight 1 st -order neighbor 5 5/9 2 nd -order neighbor 3 3/9 Other Characters 1 1/9 Phonetic similarities based weight: Phonetically similar characters 3 3/4 Other Characters 1 1/4 diting operation based weight: Candidate generated by substitution 3 3/4 Candidate generated by insertion/deletion 1 1/4 Word statistics based weights: a) First order: If k suggestions are generated, rank them according to their prior probability in corpus. The top rank gets weight k (normalized as k/ k), next one get weight k-1 (normalized as (k-1)/ k) and so on. b) Second order: Conditional probability (bigram) is employed to generate weight. These weights are linearly combined to get the score for each suggestion word. The words are then ranked according to this combined weight and displayed. 38
39 1. Declares correct words as erroneous ones (some language specific cases also. Like echo-form chai wei in Hindi). 2. Detects error, but fails to suggest alternative words. 3. Detects error and suggests alternatives but the top suggestions do not contain the intended word. 4. Detects error and suggests alternatives but they do not include the intended word. 5. Fails to detect run-on and split word error. 6. Fails to detect real-word error. 39
40 Indian language writing systems have more characters and modifiers (more than double w.r.t. nglish). lso, a joiner is needed to form compound character. So, the number of nodes in the Tries for dictionary is increased. So does the search time. To reduce the searching space we can club similarly sounding characters into single symbol. lso, a vowel and its modifier can be given single tag. For each character there is a set of distinct characters that can follow it. This is true for compound character as well. Such information can be stored in the form of a table and hence the Trie traversal can be made more efficient. 40
41 Useful for similar sounding character substitution error detection and correction. Club Long and short vowels (u, U; i, I etc.) and Consonants (r, R; n, N; s, S; J) having phonetically similar utterance into single entities. Re-organize this Semi-Phonetic dictionary into some semialphabetic ordering. Pointer is kept to all valid graphemic words for a Phonetic word. If the error is purely phonetic Substitution, then it can be easily detected and corrected using this dictionary. 41
42 म म र मण न ल र ढ मह र म म र मननननल र र मह सर 42
43 43
44 I invite all students, researchers and faculty members present here to work for the development of Indian language technology. More specifically I would request you to develop basic tools like spell-checker, real word error corrector, electronic thesaurus and word net in Indian languages. Thank You 44
NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL
NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm
More informationUNIT-4 (COMPILER DESIGN)
UNIT-4 (COMPILER DESIGN) An important part of any compiler is the construction and maintenance of a dictionary containing names and their associated values, such type of dictionary is called a symbol table.
More informationOutline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation
Outline Introduction to Parsing Lecture 8 Adapted from slides by G. Necula and R. Bodik Limitations of regular languages Parser overview Context-free grammars (CG s) Derivations Syntax-Directed ranslation
More informationRecognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier
Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier N. Sharma, U. Pal*, F. Kimura**, and S. Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute
More informationKnowledge Based Information Retrieval for Syntactic analysis of Kannada Script
98 Knowledge Based Information Retrieval for Syntactic analysis of Kannada Script Keshava Prasanna 1 Dr Ramakhanth Kumar P 2 Thungamani.M 3 ShravaniKrishna Rau 4 1,3 ResearchAssistant, Tumkur University,
More informationMeasurements of the effect of linear interpolation values and reduced bigram model size for text prediction
Measurements of the effect of linear interpolation s and reduced bigram model size for text prediction Marit Ånestad Lunds Tekniska Högskola man039@post.uit.no Michael Geier Lunds Tekniska Högskola michael.geier@student.tugraz.at
More informationSemantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95
ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 94-95 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity Methods
More informationStructural and Syntactic Pattern Recognition
Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent
More informationCS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky
CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by me and Chris Manning) Stanford University Unix for Poets Text is everywhere The Web
More informationNew Concept based Indexing Technique for Search Engine
Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/114018, May 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 New Concept based Indexing Technique for Search
More informationCSCI3381-Cryptography
CSCI3381-Cryptography Project 1: Automated Cryptanalysis of Monoalphabetic Substitution Cipher September 3, 2014 There s not much in the way of modern cryptography in this project (it probably has more
More informationCS 124/LINGUIST 180 From Languages to Information
CS 124/LINGUIST 180 From Languages to Information Unix for Poets Dan Jurafsky (original by Ken Church, modifications by Chris Manning) Stanford University Unix for Poets (based on Ken Church s presentation)
More informationSemantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96
ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 95-96 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity (Matching)
More informationLayout Segmentation of Scanned Newspaper Documents
, pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms
More informationOptical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network
International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 91-95 Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network Raghuraj
More informationSystem 44 Next Generation Software Manual
System 44 Next Generation Software Manual For use with System 44 Next Generation version 2.4 or later and Student Achievement Manager version 2.4 or later PDF0836 (PDF) Houghton Mifflin Harcourt Publishing
More informationFeedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9
Feedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9 S. Page Remarks Concern Status No. Version No. 1 1.4.9 Test Report of Akshara, is missing. Pending Pl. check Annexure 5: Definition of
More informationJan Pedersen 22 July 2010
Jan Pedersen 22 July 2010 Outline Problem Statement Best effort retrieval vs automated reformulation Query Evaluation Architecture Query Understanding Models Data Sources Standard IR Assumptions Queries
More informationCode No: R Set No. 1
Code No: R05010106 Set No. 1 1. (a) Draw a Flowchart for the following The average score for 3 tests has to be greater than 80 for a candidate to qualify for the interview. Representing the conditional
More informationDetecting code re-use potential
Detecting code re-use potential Mario Konecki, Tihomir Orehovački, Alen Lovrenčić Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 Varaždin, Croatia {mario.konecki, tihomir.orehovacki,
More informationCSCI 5582 Artificial Intelligence. Today 10/31
CSCI 5582 Artificial Intelligence Lecture 17 Jim Martin Today 10/31 HMM Training (EM) Break Machine Learning 1 Urns and Balls Π Urn 1: 0.9; Urn 2: 0.1 A Urn 1 Urn 2 Urn 1 Urn 2 0.6 0.3 0.4 0.7 B Urn 1
More informationExam Marco Kuhlmann. This exam consists of three parts:
TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding
More informationOnline Bangla Handwriting Recognition System
1 Online Bangla Handwriting Recognition System K. Roy Dept. of Comp. Sc. West Bengal University of Technology, BF 142, Saltlake, Kolkata-64, India N. Sharma, T. Pal and U. Pal Computer Vision and Pattern
More information1. Lexical Analysis Phase
1. Lexical Analysis Phase The purpose of the lexical analyzer is to read the source program, one character at time, and to translate it into a sequence of primitive units called tokens. Keywords, identifiers,
More informationTrees in java.util. A set is an object that stores unique elements In Java, two implementations are available:
Trees in java.util A set is an object that stores unique elements In Java, two implementations are available: The class HashSet implements the set with a hash table and a hash function The class TreeSet,
More informationCreation of a Complete Hindi Handwritten Database for Researchers
Journal of Pure and Applied Science & Technology Copyright 2011 NLSS, Vol. 8(1), Jan 2018, pp. 52-60 Creation of a Complete Hindi Handwritten Database for Researchers Rama Gaur 1, * and Dr. V.S. Chouhan
More informationSYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationProposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules
Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules Publication Date: 20 October 2018 Prepared By: IDN Program, ICANN Org Public Comment Proceeding Open Date: 27 July
More informationMath 4410 Fall 2010 Exam 3. Show your work. A correct answer without any scratch work or justification may not receive much credit.
Math 4410 Fall 2010 Exam 3 Name: Directions: Complete all six questions. Show your work. A correct answer without any scratch work or justification may not receive much credit. You may not use any notes,
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Naming Jia Rao http://ranger.uta.edu/~jrao/ 1 Naming Names play a critical role in all computer systems To access resources, uniquely identify entities, or refer to locations
More informationCSC D70: Compiler Optimization Register Allocation
CSC D70: Compiler Optimization Register Allocation Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons
More informationSubject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100
Code: DC-05 Subject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100 NOTE: There are 11 Questions in all. Question 1 is compulsory and carries 16 marks. Answer to Q. 1. must be written in the space
More informationspeller.c dictionary contains valid words, one per line 1. calls load on the dictionary file
mispellings speller.c 1. calls load on the dictionary file dictionary contains valid words, one per line 2. calls check on each word in the text file and prints all misspelled words 3. calls size to determine
More informationCorrelation to Georgia Quality Core Curriculum
1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens
More informationLexical Analysis. Sukree Sinthupinyo July Chulalongkorn University
Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn University 14 July 2012 Outline Introduction 1 Introduction 2 3 4 Transition Diagrams Learning Objectives Understand definition of
More informationA System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation
A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation K. Roy, U. Pal and B. B. Chaudhuri CVPR Unit; Indian Statistical Institute, Kolkata-108; India umapada@isical.ac.in
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationA simple noise model. Algorithm sketch. A simple noise model. Estimating the probabilities
Recap: noisy channel model Foundations of Natural anguage Processing ecture 6 pelling correction, edit distance, and EM lex ascarides (lides from lex ascarides and haron Goldwater) 1 February 2019 general
More informationSearching a Sorted Set of Strings
Department of Mathematics and Computer Science January 24, 2017 University of Southern Denmark RF Searching a Sorted Set of Strings Assume we have a set of n strings in RAM, and know their sorted order
More informationHandwritten Script Recognition at Block Level
Chapter 4 Handwritten Script Recognition at Block Level -------------------------------------------------------------------------------------------------------------------------- Optical character recognition
More informationIndex-assisted approximate matching
Index-assisted approximate matching Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email
More informationOCR For Handwritten Marathi Script
International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 OCR For Handwritten Marathi Script Mrs.Vinaya. S. Tapkir 1, Mrs.Sushma.D.Shelke 2 1 Maharashtra Academy Of Engineering,
More informationA Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris
A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris Outline Shortcomings in the old spell(1) Feature Requirements of a modern spell(1) Implementation Details of new spell(1)
More informationUsing Microsoft Word. Text Tools. Spell Check
Using Microsoft Word Text Tools In addition to the editing tools covered in the previous section, Word has a number of other tools to assist in working with text documents. There are tools to help you
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationConceptual document indexing using a large scale semantic dictionary providing a concept hierarchy
Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence
More informationOCR Coverage. Open Court Reading Grade K CCSS Correlation
Grade K Common Core State Standards Reading: Literature Key Ideas and Details RL.K.1 With prompting and support, ask and answer questions about key details in a text. OCR Coverage Unit 1: T70 Unit 2: T271,
More informationIntroduction to Parsing Ambiguity and Syntax Errors
Introduction to Parsing Ambiguity and Syntax rrors Outline Regular languages revisited Parser overview Context-free grammars (CFG s) Derivations Ambiguity Syntax errors Compiler Design 1 (2011) 2 Languages
More informationPropositional Logic. Part I
Part I Propositional Logic 1 Classical Logic and the Material Conditional 1.1 Introduction 1.1.1 The first purpose of this chapter is to review classical propositional logic, including semantic tableaux.
More informationChapter 3. Image Processing Methods. (c) 2008 Prof. Dr. Michael M. Richter, Universität Kaiserslautern
Chapter 3 Image Processing Methods The Role of Image Processing Methods (1) An image is an nxn matrix of gray or color values An image processing method is algorithm transforming such matrices or assigning
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationSemantic image search using queries
Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,
More informationIntroduction to: Computers & Programming: Strings and Other Sequences
Introduction to: Computers & Programming: Strings and Other Sequences in Python Part I Adam Meyers New York University Outline What is a Data Structure? What is a Sequence? Sequences in Python All About
More informationCambridge GRADE 4 Semester 2 nd EXAMINATIONS (1st February 2019)
Page 1 of 6 Cambridge GRADE 4 Semester 2 nd EXAMINATIONS (1st February 2019) SUBJECT First Language ENGLISH PAPER TITLE ENGLISH PAPER 1 Non-Fiction TIME: 1 hour MARKS: 50 marks SECTION A: Reading Comprehension(
More informationIntroduction to Parsing Ambiguity and Syntax Errors
Introduction to Parsing Ambiguity and Syntax rrors Outline Regular languages revisited Parser overview Context-free grammars (CFG s) Derivations Ambiguity Syntax errors 2 Languages and Automata Formal
More informationThe Kinect Sensor. Luís Carriço FCUL 2014/15
Advanced Interaction Techniques The Kinect Sensor Luís Carriço FCUL 2014/15 Sources: MS Kinect for Xbox 360 John C. Tang. Using Kinect to explore NUI, Ms Research, From Stanford CS247 Shotton et al. Real-Time
More informationStanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2
More informationStudents are placed in System 44 based on their performance in the Scholastic Phonics Inventory. System 44 Placement and Scholastic Phonics Inventory
System 44 Overview The System 44 student application leads students through a predetermined path to learn each of the 44 sounds and the letters or letter combinations that create those sounds. In doing
More informationINF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct
1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 8, 12 Oct. 2016 jtl@ifi.uio.no Today 2 Preparing bitext Parameter tuning Reranking Some linguistic issues STMT so far 3 We
More informationspeller.c dictionary contains valid words, one per line 1. calls load on the dictionary file
mispellings speller.c 1. calls load on the dictionary file dictionary contains valid words, one per line 2. calls check on each word in the text file and prints all misspelled words 3. calls size to determine
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationA Document Image Analysis System on Parallel Processors
A Document Image Analysis System on Parallel Processors Shamik Sural, CMC Ltd. 28 Camac Street, Calcutta 700 016, India. P.K.Das, Dept. of CSE. Jadavpur University, Calcutta 700 032, India. Abstract This
More informationSearch Engines. Gertjan van Noord. September 17, 2018
Search Engines Gertjan van Noord September 17, 2018 About the course Information about the course is available from: http://www.let.rug.nl/vannoord/college/zoekmachines/ Last week Normalization (case,
More informationComputer Algorithms-2 Prof. Dr. Shashank K. Mehta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur
Computer Algorithms-2 Prof. Dr. Shashank K. Mehta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Lecture - 6 Minimum Spanning Tree Hello. Today, we will discuss an
More informationTitle bar: The top most bar in Word window that usually displays the document and software names.
1 MICROSOFT WORD Table of Contents LINC ONE Hiding Standard toolbar, Formatting toolbar, and Status bar: To hide the Standard toolbar, click View Toolbars on the Menu bar. Check off Standard. To hide the
More informationKhmer OCR for Limon R1 Size 22 Report
PAN Localization Project Project No: Ref. No: PANL10n/KH/Report/phase2/002 Khmer OCR for Limon R1 Size 22 Report 09 July, 2009 Prepared by: Mr. ING LENG IENG Cambodia Country Component PAN Localization
More informationOmni Dictionary USER MANUAL ENGLISH
Omni Dictionary USER MANUAL ENGLISH Table of contents Power and battery 3 1.1. Power source 3 1.2 Resetting the Translator 3 2. The function of keys 4 3. Start Menu 7 3.1 Menu language 8 4. Common phrases
More information1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1
Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 2. O(n) 2. [1 pt] What is the solution to the recurrence T(n) = T(n/2) + n, T(1)
More informationFinite-State and the Noisy Channel Intro to NLP - J. Eisner 1
Finite-State and the Noisy Channel 600.465 - Intro to NLP - J. Eisner 1 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing
More informationA Framework for Efficient Fingerprint Identification using a Minutiae Tree
A Framework for Efficient Fingerprint Identification using a Minutiae Tree Praveer Mansukhani February 22, 2008 Problem Statement Developing a real-time scalable minutiae-based indexing system using a
More informationCHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS
CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one
More informationSection A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.
Section A 1. What do you meant by parser and its types? A parser for grammar G is a program that takes as input a string w and produces as output either a parse tree for w, if w is a sentence of G, or
More informationCity, University of London Institutional Repository
City Research Online City, University of London Institutional Repository Citation: Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S. & Betz, H-D. (2015). Real Time Detection and Tracking of Spatial
More informationQuery Difficulty Prediction for Contextual Image Retrieval
Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.
More informationOpportunities and Challenges of Handwritten Sanskrit Character Recognition System
Opportunities and Challenges of Handwritten System Shailendra Kumar Singh Research Scholar, CSE Department SLIET Longowal, Sangrur, Punjab, India Sks.it2012@gmail.com Manoj Kumar Sachan Assosiate Professor,
More informationAdding Source Code Searching Capability to Yioop
Adding Source Code Searching Capability to Yioop Advisor - Dr Chris Pollett Committee Members Dr Sami Khuri and Dr Teng Moh Presented by Snigdha Rao Parvatneni AGENDA Introduction Preliminary work Git
More informationIn fact, in many cases, one can adequately describe [information] retrieval by simply substituting document for information.
LµŒ.y A.( y ý ó1~.- =~ _ _}=ù _ 4.-! - @ \{=~ = / I{$ 4 ~² =}$ _ = _./ C =}d.y _ _ _ y. ~ ; ƒa y - 4 (~šƒ=.~². ~ l$ y C C. _ _ 1. INTRODUCTION IR System is viewed as a machine that indexes and selects
More information3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary
More informationPredecessor. Predecessor Problem van Emde Boas Tries. Philip Bille
Predecessor Predecessor Problem van Emde Boas Tries Philip Bille Predecessor Predecessor Problem van Emde Boas Tries Predecessors Predecessor problem. Maintain a set S U = {,..., u-} supporting predecessor(x):
More informationVision Impairment and Computing
These notes are intended to introduce the major approaches to computing for people with impaired vision. These approaches can be used singly or in combination to enable a visually impaired person to use
More informationStatic Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1
Static Semantics Lecture 15 (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1 Current Status Lexical analysis Produces tokens Detects & eliminates illegal tokens Parsing
More informationLexical Analysis. COMP 524, Spring 2014 Bryan Ward
Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others The Big Picture Character Stream Scanner
More informationQuery Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4
Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects
More informationSpelling Corrector for Android Project
Spelling Corrector for Android Project Introduction You are familiar with spell checkers. For most spell checkers, a candidate word is considered to be spelled correctly if it is found in a long list of
More informationDiscriminative Training with Perceptron Algorithm for POS Tagging Task
Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu
More informationParallel Sorting Algorithms
Parallel Sorting Algorithms Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2016/2017 (Slides based on the book Parallel Programming:
More informationUsing Microsoft Word. Text Tools. Spell Check
Using Microsoft Word In addition to the editing tools covered in the previous section, Word has a number of other tools to assist in working with test documents. There are tools to help you find and correct
More informationCS 61B Summer 2005 (Porter) Midterm 2 July 21, SOLUTIONS. Do not open until told to begin
CS 61B Summer 2005 (Porter) Midterm 2 July 21, 2005 - SOLUTIONS Do not open until told to begin This exam is CLOSED BOOK, but you may use 1 letter-sized page of notes that you have created. Problem 0:
More informationJust Sort. Sathish Kumar Vijayakumar Chennai, India (1)
Just Sort Sathish Kumar Vijayakumar Chennai, India satthhishkumar@gmail.com Abstract Sorting is one of the most researched topics of Computer Science and it is one of the essential operations across computing
More informationError annotation in adjective noun (AN) combinations
Error annotation in adjective noun (AN) combinations This document describes the annotation scheme devised for annotating errors in AN combinations and explains how the inter-annotator agreement has been
More informationDDS Dynamic Search Trees
DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion
More informationIntroduction to Hidden Markov models
1/38 Introduction to Hidden Markov models Mark Johnson Macquarie University September 17, 2014 2/38 Outline Sequence labelling Hidden Markov Models Finding the most probable label sequence Higher-order
More informationBash command shell language interpreter
Principles of Programming Languages Bash command shell language interpreter Advanced seminar topic Louis Sugy & Baptiste Thémine Presentation on December 8th, 2017 Table of contents I. General information
More informationWordPsychic. User s Manual. InvoTek, Inc Riverview Drive Alma, AR (479)
WordPsychic User s Manual InvoTek, Inc. 1026 Riverview Drive Alma, AR 72921 (479) 632-4166 support@invotek.org version 1.0.1 June 7, 2012 Copyright InvoTek Inc 2012 System Requirements 3 Installation 3
More informationComparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System
Comparative Performance Analysis of Feature(S)- Classifier Combination for Devanagari Optical Character Recognition System Jasbir Singh Department of Computer Science Punjabi University Patiala, India
More informationCS1100 Introduction to Programming
Decisions with Variables CS1100 Introduction to Programming Selection Statements Madhu Mutyam Department of Computer Science and Engineering Indian Institute of Technology Madras Course Material SD, SB,
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationTree Data Structures CSC 221
Tree Data Structures CSC 221 BSTree Deletion - Merging template // LOOK AT THIS PARAMETER!!! void BST::deleteByMerging(BSTNode* & nodepointer) { BSTNode* temp= nodepointer;
More informationCS4442/9542b Artificial Intelligence II prof. Olga Veksler
CS4442/9542b Artificial Intelligence II prof. Olga Veksler Lecture 15 Natural Language Processing Spelling Correction Many slides from: D. Jurafsky, C. Manning Types of spelling errors Outline 1. non word
More information