Summary. Corpus Collection and Topic Identification for Punjabi

Size: px
Start display at page:

Download "Summary. Corpus Collection and Topic Identification for Punjabi"

Transcription

1

2 Summary The aim of this project is to create a corpus of Punjabi news topics and then create a topic identification algorithm that can be applied to the corpus. This has involved the creation of: Web spider to automatically collect large numbers of articles, A font-encoding to Unicode converter for Gurmukhi text, A script to convert XHTML files to TEI Lite files, A Unicode Gurmukhi to loss-less ASCII converter for Gurmukhi text, A simple Punjabi words stemmer, A script to convert TEI Lite files to WEKA compatible ARFF files. This report details how an extensive set of tools was designed to create a normalised, three million word corpus. It also details how a topic identification algorithm was used to select features from the corpus that could be used with machine-learning tools to correctly categorise articles. Tools were developed using Java (for the web spider) and.net C# (all other tools and scripts). 2

3 Acknowledgments I would like to thank: Katja Markert for the many meetings she has had with me, and the delays she has had to put up with! Eric Atwell, my assessor. My eight housemates who kept me sane, albeit for their constant distractions. 3

4 1. Contents 1. Contents Introduction Objectives Minimum Requirements Deliverables Schedule Background Punjabi Language Scripts Morphology Gurmukhi Script Alphabet Miscellaneous Signs Conjuncts Unicode Other Encodings Corpora Representativeness Utilising Corpora Collecting Data Representing Data Computing Difficulties in Creating a Corpus Existing Tools for Punjabi Text Categorisation and Topic Identification Methodology Creating a Corpus Pipeline Data Sources EMILLE Corpus TDIL ISCII Corpus Websites Conclusion Data Collection Designing a Web Spider Implementation Testing Further Analysis and Improvements Using the Spider Unicode Conversion The Problem Existing Tools Design Implementation Testing Second Iteration Design Implementation of Additional Features Converting HTML Files to Unicode

5 5.5. Storage Format Design Implementation Corpus Statistics Topic Identification Algorithm Pipeline Design Conversion to ARFF Stemming Spelling Variations Word Exclusions Implementation Evaluation Corpus Appropriateness of Pages Collected by the Spider Representativeness of the Corpus Validity of TEI Files Accuracy of Unicode Conversion Topic Identification Algorithm Conclusion References Appendix...48 A. Reflection...48 B. Syllable Separation Algorithm...49 C. Example Mapping File...50 D. IEF Code Chart...52 E. Ajit Weekly Categories...54 F. TEI Sample File...55 G. Punjabi Word Stop List...57 H. Example ARFF File...58 I. WX Notation...59 J. J48 Decision Tree Results...60 K. Naïve Bayes Results...64 L. Ajit Weekly A CD-ROM with tools and files created during this project is included. Read the accompanying readme.txt file for further details. 5

6 2. Introduction The aim of this project is to collect a corpus of various Punjabi news topics, and develop and evaluate an algorithm for automatic topic identification for Punjabi Objectives The objectives of this project are: Studying ways of collecting and encoding Gurmukhi text. Investigating natural language processing techniques when applied to Punjabi. Learning about topic identification and adapting its use for Gurmukhi text Minimum Requirements The minimum requirements for this project are: Background NLP work on corpora, text categorisation, Punjabi and Gurmukhi. Spider to collect corpus from internet resources. Corpus conversion and annotation. Basic topic identification algorithm (based on unigrams only). A report evaluating the accuracy of the topic identification algorithm. Possible enhancements: Expansion to cover related languages such as Hindi, Bengali and Gujarati. Expansion to cover other Punjabi scripts such as Shahmukhi. Extensive morphological pre-processing for the topic identification algorithm Deliverables The deliverables for this project are: A corpus of Punjabi news topics. This corpus is not necessarily going to be available for redistribution. A topic identification algorithm. A report documenting the project. 6

7 2.4. Schedule October November 2005 December 2005 January 2006 January February 2006 March 2006 April 2006 April May 2006 Background research Initial prototyping of web spider Finalise web spider Corpus collection and annotation Develop topic identification algorithm Evaluate algorithm Finalise report A cross indicates that the deadline for the task was missed. A tick indicates that the deadline was achieved. 7

8 Corpus Collection and Topic Identification for Punjabi 3. Background 3.1. Punjabi Language The Punjabi (or Panjabi) language originates from the Punjab areas of both India and Pakistan. It is an Indo-European language spoken in all its dialects 1 by over one hundred million people (Gordon, 2005). It is the official language of the Indian state of Punjab and is the language most widely spoken in the Punjab province of Pakistan, even though it has no official state patronage there. Uniquely, for an Indo-Aryan language, Punjabi is tonal in nature with a high, level and low tone (Grierson, 1927). It is similar and shares much grammar with the neighbouring Hindustani language (the vernacular form of Hindi and Urdu) Scripts Punjabi can be written in a number of different scripts. The eastern variant or Indian Punjabi is written in the Gurmukhi script. The western variant or Pakistani Punjabi is written in the Nasta liq style of the Perso-Arabic alphabet known as Shahmukhi. Punjabi may also be written in Devanagari or even Latin. ਪ ਜ ਬ प ज ब Panjābī ਗ ਰਮ ਖ द वन गर Latin ہ Gurmukhi Devanagari Latin Shahmukhi This project will concentrate on Punjabi text in the Gurmukhi script Morphology Morphology is the study of the structure of words in a language and how they are altered and created. Punjabi is an agglutinative language and words are derived by adding affixes to words (Bhatia, 1993). By studying the morphology of words and by using stemming and lemmatisation, it is easier to analyse and annotate text. Lemmatisation is used to reduce a set of words down to their lexeme or root. It enables more efficient processing of text because words such as kicks, kicked and kicking are all reduced to a single word: kick. Algorithms are then applied directly to the lexeme, which may give more accurate results than applying it to each individual word. Like other Indian languages, Punjabi assigns gender to inanimate objects in addition to treating males as masculine objects and females as feminine objects. 1 This includes both Eastern and Western Punjabi (Lahnda), and Siraiki. 8

9 There are no rules to determine whether inanimate objects are feminine or masculine. However, most nouns ending in [ī] are feminine and most nouns ending in [ā] are masculine. There are exceptions to this rule which can only be determined through experience or by using a dictionary. Plurals are formed by changing the ending of words. Masculine nouns ending in [ā] change into [ē] to form the plural. Those feminine nouns ending in [a] or [ī] change to W [āṁ], and those ending in [ā] change to ਵ [vāṁ] (Kalra, 2003). There are two forms of adjectives: variable and invariable. Variable adjectives end in [ā] for masculine nouns or [ī] for feminine nouns. They inflect based on the noun type, although this is not always the case (Kalra, 2003). For example, the root ਵ ਡ [vaḍḍ], meaning big, has the following forms: Masculine singular: ਵ ਡ [vaḍḍā] Masculine plural: ਵ ਡ [vaḍḍē] Feminine singular: ਵ ਡ [vaḍḍī] Feminine plural: ਵ ਡ ਆ [vaḍḍīāṁ] Prefixes are used predominantly on nouns, and suffixes are used on both nouns and verbs (Bhatia, 1993). For example the word ਬ ਸ [baṃs], meaning family, can be converted to mean the whole family by adding the prefix ਸਰ- [sar-], meaning whole : ਸਰ- + ਬ ਸ ਸਰਬ ਸ [sar-] whole [baṃs] family [sarbaṃs] whole family Words can be formed by using both prefixes and suffixes: ਖ਼ਬਰ + - ਖ਼ਬਰ [ḵẖabar] news [-ī] -ness [ḵẖabarī] awareness ਬ - + ਖ਼ਬਰ ਬ ਖ਼ਬਰ [bē-] un- [ḵẖabarī] awareness [bēḵẖabarī] unawareness ignorance Negation is indicated by using prefixes. The prefixes ਬ - [bē-], ਨ - [nā-], ਅ- [a-], and ਅ - [aṃ-] are all used depending on the form and origin of the word. Verbs can be formed from nouns by adding suffixes such as -ਨ [-nā] and -ਣ [-ṇā]. For example: ਬ ਲ + -ਣ ਬ ਲਣ [bōl] word [-ṇā] [bōlṇā] to speak 9

10 3.2. Gurmukhi Script The Gurmukhi script is the main script used to write the Punjabi language and is the only script extensively taught for the sole purpose of writing Punjabi. It is the official script of the Indian state of Punjab and is the basis for work on this project. The Gurmukhi script is an abugida that descended from the Brahmi script of ancient India (Kalra, 2003). The script was standardised by the second Sikh guru, Angad Dev, in the sixteenth century. The name Gurmukhi literally means from the mouth of the Guru (Gill, 1996). An abugida is a script that is composed of a series of consonants which include an inherent vowel. In the case of Gurmukhi, the inherent vowel is altered using diacritics known as a vowel signs. As such, the basic characters represent syllables and not consonants. For example, in the Latin script, the letter K simply represents the consonant K. In Gurmukhi there is no basic sign for the letter K. Instead there is the letter [ka] which is represented using ਕ, and can be modified into other syllables by attaching a vowel sign such as: [ū] to become ਕ [kū] or [ī] to become ਕ [kī]. The inherent vowel, as indicated above, is [a] and is not pronounced at the end of a word Alphabet Gurmukhi consists of thirty-five distinct characters, of which three are vowel sign bearers (Gill, 1996). Gurmukhi characters do not have a one-to-one mapping with the standard Latin alphabet, and therefore a special transliteration scheme is used. The selected transliteration scheme complies with the conventions recommended in ISO 15919:2001 (Stone, 2004). However, for reasons of clarity, the final unpronounced [a] is not transliterated. ੳ ਅ ੲ ਸ ਹ - [a] - [sa] [ha] ਕ ਖ ਗ ਘ ਙ [ka] [kha] [ga] [gha] [ṅa] ਚ ਛ ਜ ਝ ਞ [ca] [cha] [ja] [jha] [ña] ਟ ਠ ਡ ਢ ਣ [ṭa] [ṭha] [ḍa] [ḍha] [ṇa] ਤ ਥ ਦ ਧ ਨ [ta] [tha] [da] [dha] [na] ਪ ਫ ਬ ਭ ਮ [pa] [pha] [ba] [bha] [ma] ਯ ਰ ਲ ਵ ੜ [ya] [ra] [la] [va] [ṛa] 10

11 The following letters are used to represent sounds not present in the Punjabi language and are used in loan words (Kalra, 2003). These are created by placing a Pairin Bindi (literally, with a dot in the foot ) onto existing, similar sounding consonants (Unicode Consortium, 2003). ਸ਼ ਕ ਖ਼ ਗ਼ ਜ਼ ਫ਼ ਲ਼ [śa] [qa] [ḵẖa] [ġa] [za] [fa] [ḻa] There are two forms of vowels used in Gurmukhi: independent vowels and dependent vowel signs. Independent vowels are constructed using the first three characters of the alphabet, known as vowel bearers. With the exception of ਅ [a] they do not represent anything on their own and cannot be used without additional vowel signs. They are used to represent vowel sounds where using a vowel sign is not suitable, such as at the beginning of a word. In cases where the inherent vowel needs to be altered for example, from [a] to [ī] a vowel sign is attached to the consonant. The following table lists the Punjabi name, independent form, dependent form and transliteration for each vowel: Muktā ਅ - [a] Kannā ਆ [ā] Sihārī ਇ [i] Bihārī ਈ [ī] Auṅkaṛ ਉ [u] Dulaiṅkaṛ ਊ [ū] Lānvāṁ ਏ [ē] Dulānvāṁ ਐ [ai] Hōṛā ਓ [ō] Kanauṛā ਔ [au] In addition to the consonants and vowel signs, there are several special characters used in Gurmukhi Miscellaneous Signs Gurmukhi uses two signs to indicate nasalisation: Tippi [ṃ] and Bindi [ṁ]. Essentially these two signs have the same function but are used in different settings. Tippi is used with the inherent vowel [a], independent and dependent forms of [i], and the dependent forms of [u] and [ū]. Bindi is used in other cases (Gill, 1996). The sounds often used in pronouncing Tippi and Bindi are similar to the n in the English suffix ing. 11

12 The Adhak sign is placed before a consonant to indicate that it is a geminate (reinforced or doubled). The effect is thus: ਗਡ ਗ ਡ = [gaḍī] = [gaḍḍī] ਸਖ ਸ ਖ = [sikhī] = [sikkhī] Conjuncts Conjuncts are used to represent consonant clusters i.e. groupings of more than one consonant. There are four main conjuncts used in modern Gurmukhi, of which the first two are by far the most common: ਰ ਰ [ra] = ਹ [ha] = ਵ [va] = ਯ [ya] = ਯ They attach to existing consonants and replace the inherent vowel. For example: ਸਰ ਸਰ = [sarī] = [srī] For other consonant clusters, it is left to the reader to determine when the inherent vowel is dropped. In some specialised situations such as a dictionary a sign known as Halant or Virama is used to explicitly kill the vowel. The following two examples are identical in terms of pronunciation and meaning: ਸ ਰ ਸਰ = [srī] = [srī] The use of the Halant or Virama is a concept borrowed from other Indic scripts and is only rarely used in Gurmukhi Unicode The Unicode Standard 4.0 includes seventy-seven code points for Gurmukhi in the Gurmukhi block from U+0A00 to U+0A7F. This includes all the characters mentioned above, in addition to some archaic and specialised characters. Gurmukhi also uses the characters Danda (U+0964) and Double Danda (U+0965) from the Devanagari block to delimit sentences. As encoded in Unicode, Gurmukhi is classified as a complex script and requires special rendering technology to be usable. Gurmukhi is supported by the main modern text rendering engines, including Uniscribe (Microsoft Windows), ICU (Linux and other software) and ATSUI (MacOS X). Unicode requires Gurmukhi text to be encoded in logical as opposed to visual order. It is then rearranged by rendering software to appear in the correct order. 12

13 Logical Order Visual Order ਕ + ਕ [ka] U+0A15 [i] U+0A3F Unicode does not assign separate code points for conjoined consonants. Instead it uses a special character known as Virama to induce conjoined consonant behaviour. [ki] ਸ + ਰ + ਸਰ [sa] U+0A38 [ra] U+0A3F [ī] U+0A40 [sarī] ਸ + + ਰ + ਸਰ [sa] U+0A38 Virama U+0A4D [ra] U+0A3F [ī] U+0A40 Finally, although Ura, Aira and Iri are encoded in the Gurmukhi sub-range, using such characters with vowel signs to make independent vowels is not recommended. Instead, pre-composed independent vowels are encoded which should be used instead Other Encodings Although Unicode is now used as the de facto standard for storing text on modern computers, its use for Gurmukhi is limited (although constantly increasing). In the past, the ISCII encoding scheme was used for Gurmukhi which formed the basis for Gurmukhi support in Unicode. Conversion from ISCII to Unicode is relatively straight forward due to the similarities in encoding. The predominant encoding for Gurmukhi is the use of fonts such as AnmolLipi and DrChatrikWeb. These function by masking ASCII characters so they appear as Gurmukhi characters and all have differing mappings. For example, under the AnmolLipi font, the Latin character A would appear as ਅ. Conversely, under the DrChatrikWeb font it would appear as ੳ. They are used because they require no complex rendering (rendering support for Unicode Gurmukhi was scant until recently). This project aims to standardise on Unicode because it is an internationally recognised standard that is supported by a large number of operating systems. [srī] 13

14 3.3. Corpora In its simplest form, a corpus (from Latin, meaning body ) is a collection of more than one text (McEnery, Wilson, 2001). It can comprise of both monolingual and multilingual content. Corpora may be either annotated or unannotated. An annotated corpus includes metadata (information about the data), whereas an unannotated corpus contains merely the raw text. Some multilingual corpora are formatted so that related words are aligned throughout different sets of languages. These are known as parallel aligned corpora and fall under the umbrella of annotated corpora Representativeness In creating a corpus, it is necessary for it to be representative of the language or the task at hand. It is preferable that a broad sample of text is used so that the final corpus accurately depicts the variety in language. Obviously, not all utterances or acceptable sequences of text will be present, but a corpus based on representative samples is likely to have a wide and balanced mix of language. However, the use of corpora is countered by linguists such as Noam Chomsky who comment that they can never be fully representative of real language (McEnery, Wilson, 2001). Chomsky believed that, by their very nature, corpora can never explain the inventiveness of language; many perfectly valid sentences will never be included while some invalid sentences will. To fulfil the requirements of this project, a monolingual sample of news articles collected using a web spider will be required. In addition, if suitable text is found is existing corpora, it should be used too. To be representative, the articles must be modern, and from differing authors Utilising Corpora Corpora provide an ideal body of text that linguists can use to analyse the nuances of human language. Electronic corpora (which are almost exclusively the only corpora that are now in widespread use) can be analysed extensively by computer programs. Corpora are relevant because they allow linguists (and others interested in language) to analyse a vast body of text using complicated algorithms. It allows them to analyse a language without having to conduct real life surveys. A corpus can be used to provide large amounts of information on language grammar, spelling, conventions and morphology. Complex algorithms can be applied to extract higher-level information about the content or meaning of the text, rather than statistics on the text itself. This project will be utilising data from a corpus to categorise text. 14

15 Collecting Data Traditionally, corpus data was either collated by hand or typed up for the specific purpose of including it in a corpus. Nowadays, corpora creation is considerably more automated. The use of the internet has allowed extensive corpora to be created with considerably less effort. In the context of major world languages such as English, Russian, Arabic and Chinese, simple unannotated corpora can be collected with comparative ease because there are standardised encodings for the scripts used. This is not the case for Punjabi. In addition to the many scripts used, there are also multiple encodings for the Gurmukhi script. Until recently, the only way of representing Gurmukhi text on web pages was by using fonts that mask ASCII characters and show them as Gurmukhi characters. Different fonts have different mappings to ASCII characters and no search engines index this as Gurmukhi text. Since the introduction of Windows XP with its Unicode support for Gurmukhi, increasing numbers of websites using Unicode have appeared. However, their presence is miniscule compared to the vast array of websites encoded using proprietary fonts. The creators of the EMILLE corpus encountered problems with the unavailability of standardised text and with considerable amounts of text in images (Singh, 2000). Fortunately, the use of images for Gurmukhi text has largely subsided in recent years Representing Data In the past, different corpora would use different storage formats. In recent years there has been a move to standardise the formats used for corpora and this has been spearheaded by the Text Encoding Initiative (TEI). The TEI uses SGML (Standard Generalised Markup Language) and increasingly XML (extensible Markup Language). It provides DTDs which can be used to validate the conformity of the markup used. The TEI markup is quite complex when completely implemented. However, a stripped down version known as TEI Lite P4 is available for use. A subset of TEI Lite XML is likely to be used to store any corpus data collected for this project because it is simple to use, and more advanced features of the standard TEI format are not needed Computing Difficulties in Creating a Corpus A large corpus can be very technically challenging. There is the initial need to find sources of data (websites, newspapers, books or speech), and then automate a way to collect it all. In the case of extracting data from websites, a web robot or spider is required. Once the data has been collected, it must be cleaned (removing non-body text such as navigation elements or advertisements), converted to Unicode (or another appropriate encoding) and then the required text needs be extracted and placed in a TEI file. It is relatively simple to do such tasks by hand, but when collecting a corpus of thousands or even hundreds of pages of text, it is simply too time consuming to be economical. 15

16 Existing Tools for Punjabi Existing corpora for Punjabi are limited. There exists an ISCII encoded, unannotated corpus (created by the Indian Ministry of Information Technology) which has approximately three million words. 2 It includes a varied array of topics, including some news articles. The EMILLE corpus, created by Lancaster University, includes over fifteen million Punjabi words from a variety of sources, all encoded in Unicode. Of this, three million are in Shahmukhi script which makes them unsuitable for this project. Both of these corpora may be suitable for use in experimenting with topic identification. Their suitability will be discussed later. The International Institute of Information Technology, India, 3 has created a morphological analyser for Punjabi that runs on Unix-based systems. It takes a list of Punjabi words in Roman WX notation and returns the root word with additional morphological information. Roman WX has a one-to-one relationship with ISCII and should not present a problem in terms of converting encodings. There are a number of high quality multi- and monolingual Punjabi dictionaries available in print form. However, the presence of dictionaries of any quality in electronic form is limited. There are a number of basic dictionaries (or more appropriately word lists) available online, but there is nothing that compares to the breadth of quality of print dictionaries. 2 This figure was calculated by estimating the number of words in the 14 MB corpus from a sample word count in a 200 KB file. 3 Available from 16

17 3.4. Text Categorisation and Topic Identification Automatic text categorisation is the process of assigning a document to one or more categories based on its contents. It is a topic in natural language processing that has a seen a significant increase in interest. There are several approaches that are commonly employed when categorising text. One such approach uses decision trees. This method gives weightings to words and the number of times they occur, and then uses the weightings and a decision tree to categorise documents. This is a simple approach that can be highly effective. (Manning, Shütze, 2000) Maximum entropy is a technique used to calculate probability distributions from data. This can be applied to text categorisation and is an area of active research. Much like the use of decision trees, a maximum entropy model is applied to a document whose words have been weighted. A set of pre-tagged training data is used to estimate suitable constraints. These constraints are then used to estimate the probability of non-tagged data belonging to a particular category. (Nigam, et al, 1999) Maximum entropy modelling works by observing the features of the training data and formulating constraints based on the original data. It assumes nothing about the unknown data and aims to formulate a model which is uniform and factually consistent. (Berger, 1996) The k nearest neighbour classification is a relatively simple method used in natural language processing. It classifies a document by finding the most similar document (the nearest neighbour) in a training set and assigning its category to the new document. (Manning, Shütze, 2000) All the machine learning algorithms used above require some sort of pre-processing before text can be categorised. It is this pre-processing or feature selection that is crucial in correctly categorising text. Simply using all words in the differing documents would not only be less accurate, but painfully slow. Techniques such as stemming (reducing all words to their root form), stop-lists (removing common words) and spelling unification (treating similarly spelled words as one) can all increase the accuracy of a machine learning algorithm that is attempting to categorise a document. There are also different ways that terms can be weighted. Some text categorisers may use complex algorithms to weigh words, whilst others may just indicate whether a word is present or not in a particular document. WEKA, created by the University of Waikato, implements many differing machine learning algorithms (Witten, Frank, 2005). It is often used in text categorisation work to test the effectiveness of different algorithms. 17

18 4. Methodology A software development methodology is required to standardise the development of any tools for this project. Three methodologies were evaluated for use: (Bennett, et al., 2002) Waterfall A concentration on the traditional life cycle with requirements analysis, design, construction, testing, and installation stages. Waterfall with iteration As above, but with the ability to return to any previous stage and alter subsequent stages as a result of changes in the development process. Unified software development process (USDP) Comprises four distinct phases: inception, elaboration, construction and transition. Different workflows are concentrated on in different phases, but the workflows are constantly updated as time passes. The following diagram illustrates this: (Krutchten, 2001) None of the methodologies listed are entirely satisfactory for this particular project. They are all geared towards the creation of an information system solution as their main product. The aim of this project is to primarily create a corpus and secondarily implement a topic identification algorithm. As such, it was decided that the waterfall model with iteration (when required) would be the most appropriate method for the overall project. It allows a structure development process to occur, with the advantage of being able to go back if required and enhance or change the solution. 18

19 5. Creating a Corpus 5.1. Pipeline To convert the bare HTML files to TEI Lite requires several individual, but automated steps. Together they form a pipeline of processes. The diagram below indicates the processes involved, and any output. 1. Web Spider 2. HTML Tidy 3. Metamorph 4. TEI Script Font-based HTML File Font-based XHTML File Unicode XHTML File Information File XML TEI Lite Corpus File The steps involved in each stage will be explained later Data Sources A discussed previously, a corpus of categorised Punjabi texts will be created so that a topic identification algorithm can be applied to them later. In addition, existing data from two corpora will be evaluated: EMILLE corpus TDIL ISCII corpus These will be evaluated to ensure they are encoded accurately, contain news articles and have the appropriate categories required. Websites will be reviewed to see which are the most suitable for extracting data from for our corpus. 19

20 EMILLE Corpus The EMILLE corpus was initially thought to be a good base to extract data from because it contained several news articles in Punjabi. However a further examination revealed the following problems: No articles were appropriately tagged with a genre or topic that is necessary for the topic identification component of this project. Articles were not properly separated. Instead large groups of article were all stored as one big block of text and so it would have been difficult to split them up. There were huge inconsistencies and errors in the Unicode encoding. By far the predominant reason that EMILLE was not used was the inconsistencies in the text encoding. Examples: The dotted circles indicate major errors in the Unicode text caused by an incorrect conversion from the source text. In this example a consonant has consistently been converted to. There has also been no rearrangement of the components of a syllable (see section 3.2). 20

21 In this example, the character Tippi is always encoded as a Bindi. Geminate consonants are not encoded using Adhak but instead as consonant clusters using Virama. These errors are likely to stem from an incorrect conversion from ISCII to Unicode. This example shows considerable corruption that has occurred when converting the text to Unicode TDIL ISCII Corpus The ISCII encoded corpus from TDIL was inappropriate because it was by no means an overly news based corpus nor did it have any markup on the text to indicate genre or topic. There were only a few news pages which rendered it unsuitable for use as part of the topic identification portion of this project. 21

22 Websites The internet has emerged as a phenomenal international corpus with a wide variety of data. There were several possible choices available for use. The main features looked for were: Large news archive, News categorisation, Separated news articles, Print preview ability (simplifies the removal of superfluous elements such as navigation bars). A few well known news sites were evaluated: Website Advantages Disadvantages Ajit Weekly Pages categorised into forty DrChatrikWeb categories. Very large news archives. Separated news articles. Ajit Jalandhar Satluj Quami Ekta DrChatrikWeb and Unicode Sanjh Savera DrChatrikWeb Amritsar Times DrChatrikWeb 5abi.com Unicode Print preview version. Large news archives. Pages categorised into geographic locations. Some pages encoded in Unicode. Most pages are categorised. Separated news articles. Print preview version. Large news collection. Most pages encoded in Unicode. Large news collection. Encoded in Satluj font. No print preview version. No print preview version. No categorisation of text. Small collection of news. No print preview version. No categorisation of text. No uniform design aspect to facilitate data extraction. No print preview version. No categorisation of text. The most suitable choices were Quami Ekta and Ajit Weekly. Quami Ekta was considered inappropriate because it did not contain anywhere near as many articles as Ajit Weekly, nor were the articles categorised in a way that would enable easy unification with the Ajit Weekly categories Conclusion It was a rather clear-cut decision that the only way to create a suitable corpus for this project was to do so by collecting text from a large news web site. The EMILLE corpus had deficiencies in encoding, article separation and categorisation. The TDIL corpus did not comprise of the appropriate content matter required for this project. The Ajit Weekly web site was selected as a suitable site from which data could be extracted for the reasons stated above. 22

23 5.3. Data Collection To facilitate data collection, it would be necessary to automate the collection of web pages. Creating a large corpus manually is not a feasible idea because of the considerable amount of time it would take. There are several web spiders or web crawlers available online. However, most are geared towards generating search indexes. The ones geared towards corpora work such as WebCorp (RDUES, 2005) and Bootcat (Baroni M, et al., 2005) were not suitable because they were concentrated on random web page collection based on Google keywords. They also did not have features to restrict saved pages based on URL parameters. It was decided that for ultimate customisability, the best option would be to create a web spider suitable for this task Designing a Web Spider Two separate development languages were considered: Java and C#. Both are very similar syntactically and many of the class sets also have similarities. Both had high-level classes that simplified downloading using the HTTP protocol. Although there was no clear cut superior language for the task, it was decided that Java would be the most appropriate language because of its cross-platform support and large user base. There are two types of web spider that could be used: a depth-first spider or breadth-first spider. (Eddy and Haasch, 1996). The diagram below illustrates how a depth-first recursive spidering algorithm operates for a depth of 3 (0 being the root). 1 index.html step1.html about.html extra.html 7 step1.html step2.html index.html about.html 2 6 step2.html step3.html index.html about.html step3.html about.html extra.html index.html 8 23

24 1. Download and parse first link in root. 2. Download and parse first link in child Download first link in child Download next link in child 2. In this case, index.html has already been downloaded, so download the next link after that. 5. Download next link in child 2. In this case, all links have been downloaded. 6. Download and parse next link in child 1. In this case, all links have been downloaded. 7. Download and parse next link in root. In this case, about.html has already been downloaded, so download the next link after that. 8. Continue The diagram below illustrates a breadth-first implementation, where the spider downloads all linked pages on a webpage at a particular depth, before moving onto the next depth level: index.html step1.html about.html extra.html step1.html step2.html index.html control.html about.html index.html extra.html index.html step2.html step3.html index.html about.html control.html index.html In this diagram one can see how all links are retrieved at the next depth (unless the file has already been downloaded). Both a breadth-first and depth-first implementation of the web spider will be created. The one which results in the most appropriate results will be used to collect the corpus. The web spider should save any web pages it retrieves, plus an additional information file containing the source URL, the date and time the page was retrieved Implementation There are four separate classes that shall be required: Main Holds the entry function and joins all other classes together. Downloader Contains the download engine to facilitate the collection of web pages via HTTP. Spider The spidering algorithm itself. Webpage Contains a particular web page with additional metadata. 24

25 The methods for each of these classes are detailed below: Webpage methods: void addchild(webpage) Add the specified web page as a child of the current web page. Webpage getchild(int) Retrieve the child web page using its index number. void removechild(int); void removechild(webpage) Remove the child web page using its index number or directly using the object reference. int getchildcount() Get the number of child web pages. string getcontent(); setcontent(string) Get/set the HTML text of the web page in string format. byte[] getcontentbytes(); setcontentbytes(byte[]) Get/set the web page as a byte array. int getdepth(); setdepth(int) Get/set the depth of the current web page. string geturl(); seturl(string) Get/set the URL of the current web page. void savepage(string) Save the current byte version of the web page to disk. Downloader methods: byte[] downloadpage(string) Downloads the specified web page and returns a byte array. string[] extracturls(string) Extracts all the URLs contained within the specified HTML text and returns a string array. Spider methods: void startbreadthspider(string, string, int) Initiates the breadth-first spider using a source URL, a save to path and a maximum depth limit. void startdepthspider(string, string, int) Initiates the depth-first spider using a source URL, a save to path and a maximum depth limit. To address unforeseen issues, it may be necessary to add different methods or alter the methods listed above. However, the general structure will remain the same Testing The implementation of depth-first spider was evaluated using various test sites. The depth-first spider was highly effective at downloading web pages, but it suffered from one major flaw. If a web page was a leaf node due to it being at the maximum depth for that branch, it would be 25

26 flagged as being downloaded. Thus if the spider encountered this page again further up the tree, it would not search its contents because it had already been flagged as downloaded and processed. This issue was alleviated by also flagging the depth at which a web page was first downloaded. If that depth was the maximum depth and the web page was reencountered further up the tree, it would be parsed again for URLs but would not be saved to disk again. Further testing revealed another problem with this implementation. If a page had been flagged as being downloaded close to a leaf node and then was encountered further up the tree, it would not be processed. This is because its child nodes have already been processed. However, if it occurs further up in the tree, then there may be more child nodes to spider below. This issue could also have been alleviated by forcing the spider to continually retrieve all pages in its node list, even if they had been previously spidered. However, this would lead to an exponential increase in the number of pages downloaded or checked. Alternatively, the pages could be kept in memory and re-parsed. This breadth-first implementation is more effective because it ensures all pages are parsed to their maximum depth. For example, in the breadth-first diagram above, in the file step2.html you see the link to about.html. This will not be downloaded because about.html has already been parsed. However, the children of about.html will still be spidered to the maximum depth. In the depth-first approach, about.html would have been downloaded later on in the tree. This would have resulted in its children being spidered to a lower depth Further Analysis and Improvements As a result of the testing, it was seen that the breadth-first implementation of the web spider was by far the most effective. As a result, this will be used to collect the corpus. Ajit Weekly contains special printer-friendly article formats which are only used for proper articles and not for miscellaneous information pages on the web site. Thus, using only the printerfriendly version ensures that articles are downloaded without additional formatting or layout. This will hopefully reduce any post-processing that needs to be done on the articles. To ensure that the spider will only save the printer-friendly versions of articles, an extra parameter was introduced which restricted what pages were saved based on the content of the URL. To put this into effect for Ajit Weekly, only web pages which contained the string read_printable.asp will be saved (although all pages encountered will be downloaded and processed) Using the Spider After the final version of the spider was tested and compiled, it was instructed to collect a corpus using the following command: java -cp. spider.main c:\spider\ajitweekly 500 read_printable.asp The parameters are: source web site, local save to address, maximum depth and must contain text. The spider collected 7,024 pages and reached a depth of 133 before the entire domain was spidered. 26

27 5.4. Unicode Conversion The source data is not encoded in the Unicode format. For reasons of data interchange stability, it will be necessary to convert it to Unicode before it can be processed further The Problem A font-encoding has no rules and simply treats Gurmukhi as an alphabet. This is contrary to rendering engines which implement Gurmukhi Unicode. They treat Gurmukhi as a syllable-centric script with strict enforcement of orthographic rules. Because a font-encoding does not enforce any rules, a completely identical syllable may be represented in multiple ways. In some cases there may be over ten identical ways to encode the same syllable. It is absolutely vital to ensure the corpus represents text in a normalised way i.e. there should be only one way to represent a syllable. Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult. If one takes the hypothetical syllable: ਸਰ = [srēṁ] In Unicode there is no ambiguity as to how this is encoded. In font-encodings there are at least six distinct ways of representing this syllable. This is because every diacritic is zero-spacing which means they can be added in any order after the consonant and still appear correctly. Unicode: ਸ + + ਰ + + Font encoding: ਸ + ਰ + + ਸ + ਰ + + ਸ ਰ ਸ + + ਰ + ਸ ਰ ਸ + + ਰ + This problem is exasperated by the fact that it is nearly impossible to visually detect when the same non-spacing character is repeated. It is also difficult to detect when a larger non-spacing character overlaps a smaller one. For example: ਸ = ਸ ਸ + + = ਸ ਸ + + = ਸ In these examples, the most visually apparent sign is shown in the final syllable. 27

28 These issues can cause a large number of errors when trying to computationally process Gurmukhi text. For example, in terms of topic identification, if several differing authors had created documents with different typing orders (but all the same category), the machine learning algorithm would be unable to resolve the fact that many of the words with different encodings are the same. This can lead to significantly reduced rates of accuracy when categorising documents. A comprehensive solution is required to fix these problems whilst also converting the text to Unicode Existing Tools Unicodify was created by Lancaster University for collecting the EMILLE corpus and includes support for converting several fonts (AnmolLipi, Satluj and others) into Unicode (Hardie, 2004). Preliminary testing has revealed that although a one-to-one conversion using Unicodify was fairly accurate, it did not include any error correction or rearrangement features necessary to normalise the text. The Gurmukhi Unicode Conversion Application (GUCA) was developed by the author of this report, and is released by the Punjabi Computing Resource Centre (Sidhu, 2004). GUCA was considerably more accurate than Unicodify, but it still did not include sufficient error correction. As a result, it will be necessary to implement an improved solution to ensure the corpus is accurate and normalised. GUCA is open-source which enables much of the code to be reused for this project Design The programming language selected for the development of this program (called Metamorph ) is C# on the.net framework. Java was not used because Metamorph may make use of C# code from GUCA. C# is also cross-platform when used with Mono. GUCA uses a linear conversion algorithm which is perfectly adequate for most text conversions. However, this is not likely to be the best solution for a script such as Gurmukhi when analysis of a syllable is required to fix errors. The first task is to fully analyse the components and correct ordering of a Gurmukhi syllable. It is composed of: Consonant + Conjoined Consonant(s) + Vowel Sign + Nasal or Auxiliary Signs Or as a.net regular expression: (C) (N)* (HC(N)*)* (V)* (X)* Where C = consonant, N = pairin bindi [nukta], H = halant [virama], V = vowel sign and X = other signs. * indicates 0 or more of the previous expression. An independent vowel is represented using a vowel bearer (represented as a consonant) and a vowel sign. The only compulsory component of a syllable is the consonant (the vowel [a] is inherent within a standalone consonant). 28

29 The next task is to separate the font-based text into syllables. Because vowel signs can come before the consonant, it is necessary to take this into account. This can be achieved using a specialised algorithm as shown in appendix B. Although Ajit Weekly only uses the DrChatrikWeb font, if designed properly, Metamorph could easily include the ability to convert other fonts in the future. The vast array of differing fonts means that anyone wishing to create a large corpus in the future will require several converters. To address this issue, an intermediary encoding format (IEF) is required. Texts encoded using differing fonts would initially be converted to a single intermediary encoding. The algorithm could then be applied exclusively to the intermediary encoding: DrChatrikWeb AnmolLipi Satluj Intermediary Encoding Unicode DrChatrikWeb AnmolLipi The diagram below shows how this is done in practice: ƒ ਨ + + AnmolLipi 0192 Gurmukhi IEF F128 + F174 + F17C Unicode 0A28 + 0A42 + 0A70 ƒ DR Chatrik Web 0192 ù Satluj 00F9 An ideal format to store information about mappings (i.e. what byte sequence turns into another byte sequence) is XML. Not only is it simple to use, but it allows future extensibility. Using XML files gives the advantage of being able to add new mappings without recompiling or reinstalling the program and allows users to add their own customised mappings. See appendix C for details of the file format. This approach also allows for the introduction of font-to-font converters via the IEF. For details of the IEF, see appendix D. Once the file has been converted to the IEF, it can then be converted into Unicode. For this to be done, Metamorph must classify each individual syllable component and then re-order it based on the logical encoding rules of the Unicode Standard. 29

30 For example, the font encoded text sequence: + ਸ + + ਰ Is currently ordered: vowel sign, consonant, nasal sign and conjoined consonant. When converted to Unicode, this should be in the order: consonant, conjoined consonant, vowel sign and nasal sign Implementation There is only one class required for the core of the implementation, and that is referred to as the ConversionEngine. Its methods are detailed below: string[] ListMappings(string) Returns an array of filenames to all mapping files in the specified directory. MappingDetails LoadMapping(string) Loads the mapping file specified. string Convert(string, MappingDetails) Convert the specified string to another encoding using the specified mapping file. string ConvertToUnicode(string) Converts the specified IEF-encoded string into Unicode. The MappingDetails class contains information about the mappings obtained from an XML file. It shall contain metadata (author, copyright, etc.) and a list of one-to-one mappings. A simple user interface was created that allowed a user to enter text at the top, which would be converted and shown at the bottom. In addition, an Options dialog allows users to alter settings such as font size and installed mappings Testing This initial implementation was tested once a mapping file for DrChatrikWeb was created. The conversion was as expected, but no error correction procedures were implemented. Nor was there any easy way to convert the masses of HTML text automatically. To address these outstanding issues, a second iteration commenced. 30

31 Second Iteration Design As mentioned earlier, it is crucial that the conversion utility also has some basic, but highly effective, error correction procedures. These are made to repair very common (but fixable) errors that can occur using font-encodings. Metamorph should make the following corrections: Automatic correction of nasal signs based on the accompanying vowel, Correcting invalid combinations of a vowel bearer and vowel sign, Selecting the most visible vowel sign if two or more vowel signs overlap, Removing duplicated vowel signs, Removing duplicated conjuncts which are not used in Gurmukhi. The first correction feature (nasal sign selection) is only suitable for modern text. Archaic Gurmukhi text (pre-1950s) may not necessarily follow this convention and so it is not always suitable. Fortunately, all the text on Ajit Weekly is modern, being a few years old at most. Metamorph also needs to be expanded to support conversion of HTML files. Without the ability to automatically convert the files, it would take considerable time and effort to convert each individual text element. There are two approaches available to process HTML files: Process as SGML Convert to XHTML and process as XML The first approach has the advantage that the HTML does not need to be altered to be processed. However this is offset by the fact that SGML is more difficult to process and that the.net framework does not offer an SGML parser. The second approach requires a utility for conversion into XML. A cross-platform application called HTML Tidy (Raggett, 2005) does this efficiently and is available free. Parsing valid XML is easier than parsing SGML and so this approach will be taken. It will be necessary for Metamorph to convert XML tags based on variables such as tag names and attribute values. To ensure this will be as versatile as possible, regular expressions will be supported when selecting appropriate tags Implementation of Additional Features The ConversionEngine class has been expanded to include an ErrorCorrection property, which causes certain checks and repair features to be activated. A BatchProcessor class has been introduced to handle the processing of a large number of XML files. It features options to select tags based on regular expressions applied to tag names and attribute values. 31

32 It also has general options to select which mapping to use, and to indicate how the files should be renamed: These new additions were extensively tested to ensure that the output text was as desired. The batch processor was tested to ensure only the appropriate portions of the XML file were converted and that that output was valid XML Converting HTML Files to Unicode Detecting the font used in a particular HTML tag is no trivial task. There are many ways to specify fonts: cascading style sheets (CSS), style attributes and font tags. Any imbedded tags inherit the styling of parent tags, unless they specifically override the style. Fortunately, in the case of Ajit Weekly, all formatting was done via CSS. This required simply checking the class attribute of the font tag. This was simplified further by the fact that the spider only saved print preview versions of articles. The XML conversion facility created in Metamorph enables users to specify regular expressions to determine which XML tags to convert. For Ajit Weekly, a simple regular expression was required. Tag: ^font$ Attribute: class Attribute Value: ^drc Metamorph then processed the files, converting text in any tags that matched the regular expression queries above. 32

Punjabi Indic Input 2 - User Guide

Punjabi Indic Input 2 - User Guide Punjabi Indic Input 2 - User Guide Punjabi Indic Input 2 - User Guide 2 Contents WHAT IS PUNJABI INDIC INPUT 2?... 3 SYSTEM REQUIREMENTS... 3 TO INSTALL PUNJABI INDIC INPUT 2... 3 TO USE PUNJABI INDIC

More information

Proposed Changes to Gurmukhi 2

Proposed Changes to Gurmukhi 2 Proposed Changes to Gurmukhi 2 Document Number: L2/05-167 Submitter s Name: Sukhjinder Sidhu (Punjabi Computing Resource Centre) Submission Date: 1 August 2005 Abstract This document addresses issues raised

More information

Proposed Changes to Gurmukhi

Proposed Changes to Gurmukhi Proposed Changes to Gurmukhi Document Number: L2/05-088 Submitter s Name: Sukhjinder Sidhu (Punjabi Computing Resource Centre) Submission Date: 21 April 2005 Abstract This document proposes changes in

More information

SEGMENTATION OF BROKEN CHARACTERS OF HANDWRITTEN GURMUKHI SCRIPT

SEGMENTATION OF BROKEN CHARACTERS OF HANDWRITTEN GURMUKHI SCRIPT 95 SEGMENTATION OF BROKEN CHARACTERS OF HANDWRITTEN GURMUKHI SCRIPT Bharti Mehta Department of Computer Engineering Yadavindra college of Engineering Talwandi Sabo (Bathinda) bhartimehta13@gmail.com Abstract:

More information

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA Transliteration of Tamil and Other Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA Main points of Powerpoint presentation This talk gives

More information

Offline Handwritten Gurmukhi Character Recognition: A Review

Offline Handwritten Gurmukhi Character Recognition: A Review , pp. 77-86 http://dx.doi.org/10.14257/ijseia.2016.10.5.08 Offline Handwritten Gurmukhi Character Recognition: A Review Neeraj Kumar 1 and Sheifali Gupta 2 1 Electronics & Communication Department Chitkara

More information

The Unicode Standard Version 11.0 Core Specification

The Unicode Standard Version 11.0 Core Specification The Unicode Standard Version 11.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

B. Technical General 1. Choose one of the following: 1a. This proposal is for a new script (set of characters) Yes.

B. Technical General 1. Choose one of the following: 1a. This proposal is for a new script (set of characters) Yes. ISO/IEC JTC1/SC2/WG2 N3024 L2/06-004 2006-01-11 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация

More information

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet A. Sean Pue South Asia Language Resource Center Pre-SASLI Workshop 6/7/09 1 Objectives To understand how

More information

Automatic Bangla Corpus Creation

Automatic Bangla Corpus Creation Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net

More information

Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound.

Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound. ISO/IEC JTC1/SC2/WG2 N3023 L2/06-003 2006-01-11 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация

More information

Template for comments and secretariat observations Date: Document: ISO/IEC 10646:2014 PDAM2

Template for comments and secretariat observations Date: Document: ISO/IEC 10646:2014 PDAM2 Template for s and secretariat observations Date: 014-08-04 Document: ISO/IEC 10646:014 PDAM 1 (3) 4 5 (6) (7) on each submitted GB1 4.3 ed Subclause title incorrectly refers to CJK ideographs. Change

More information

Development of Font Independent Punjabi Typing System

Development of Font Independent Punjabi Typing System Development of Font Independent Punjabi Typing System Inderjeet Kaur DCS, Punjabi University Patiala, Punjab, India ABSTRACT-- This paper presents an attempt in designing and development of Font Independent

More information

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm

More information

Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules

Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules Publication Date: 20 October 2018 Prepared By: IDN Program, ICANN Org Public Comment Proceeding Open Date: 27 July

More information

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document.

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document. Dated: April 28, 2006 Title: Proposal to add TAMIL OM Source: International Forum for Information Technology in Tamil (INFITT) Action: For consideration by UTC and ISO/IEC JTC 1/SC 2/WG 2 Distribution:

More information

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS University of Michigan Ann Arbor, Michigan, U.S.A. pandey@umich.edu May 21, 2007 1 Introduction This is a proposal to encode

More information

Proposal on Handling Reph in Gurmukhi and Telugu Scripts

Proposal on Handling Reph in Gurmukhi and Telugu Scripts Proposal on Handling Reph in Gurmukhi and Telugu Scripts Nagarjuna Venna August 1, 2006 1 Introduction Chapter 9 of the Unicode standard [1] describes the representational model for encoding Indic scripts.

More information

Cindex 3.0 for Windows. Release Notes

Cindex 3.0 for Windows. Release Notes Cindex 3.0 for Windows Release Notes The information contained in this document is subject to change without notice, and does not represent a commitment on the part of Indexing Research. The program described

More information

Proposal to encode Devanagari Sign High Spacing Dot

Proposal to encode Devanagari Sign High Spacing Dot Proposal to encode Devanagari Sign High Spacing Dot Jonathan Kew, Steve Smith SIL International April 20, 2006 1. Introduction In several language communities of Nepal, the Devanagari script has been adapted

More information

Parallel Concordancing and Translation. Michael Barlow

Parallel Concordancing and Translation. Michael Barlow [Translating and the Computer 26, November 2004 [London: Aslib, 2004] Parallel Concordancing and Translation Michael Barlow Dept. of Applied Language Studies and Linguistics University of Auckland Auckland,

More information

Üù àõ [tai 2 l 6] (in older orthography Üù àõ»). Tai Le orthography is simple and straightforward:

Üù àõ [tai 2 l 6] (in older orthography Üù àõ»). Tai Le orthography is simple and straightforward: ISO/IEC JTC1/SC2/WG2 N2372 2001-10-05 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation еждународная организация по

More information

Joiners (ZWJ/ZWNJ) with Semantic content for words in Indian subcontinent languages

Joiners (ZWJ/ZWNJ) with Semantic content for words in Indian subcontinent languages Joiners (ZWJ/ZWNJ) with Semantic content for words in Indian subcontinent languages N. Ganesan This document gives examples of Unicode joiners, ZWJ and ZWNJ where the meanings of words differ substantially

More information

TECkit version 2.0 A Text Encoding Conversion toolkit

TECkit version 2.0 A Text Encoding Conversion toolkit TECkit version 2.0 A Text Encoding Conversion toolkit Jonathan Kew SIL Non-Roman Script Initiative (NRSI) Abstract TECkit is a toolkit for encoding conversions. It offers a simple format for describing

More information

Multilingual mathematical e-document processing

Multilingual mathematical e-document processing Multilingual mathematical e-document processing Azzeddine LAZREK University Cadi Ayyad, Faculty of Sciences Department of Computer Science Marrakech - Morocco lazrek@ucam.ac.ma http://www.ucam.ac.ma/fssm/rydarab

More information

Issues in Indic Language Collation

Issues in Indic Language Collation Issues in Indic Language Collation Cathy Wissink Program Manager, Windows Globalization Microsoft Corporation I. Introduction As the software market for India 1 grows, so does the interest in developing

More information

Rendering in Dzongkha

Rendering in Dzongkha Rendering in Dzongkha Pema Geyleg Department of Information Technology pema.geyleg@gmail.com Abstract The basic layout engine for Dzongkha script was created with the help of Mr. Karunakar. Here the layout

More information

Hostopia WebMail Help

Hostopia WebMail Help Hostopia WebMail Help Table of Contents GETTING STARTED WITH WEBMAIL...5 Version History...6 Introduction to WebMail...6 Cookies and WebMail...6 Logging in to your account...6 Connection time limit...7

More information

also represented by combnining vowel matras with ē, and ō: ayɯ, eyi, ayi;

also represented by combnining vowel matras with ē, and ō: ayɯ, eyi, ayi; JTC1/SC2/WG2 N4025 L2/11-120 2011-04-22 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1 INTERNATIONAL COUNCIL FOR HARMONISATION OF TECHNICAL REQUIREMENTS FOR PHARMACEUTICALS FOR HUMAN USE ICH M8 Expert Working Group Specification for Submission Formats for ectd v1.1 November 10, 2016 DOCUMENT

More information

INTERNATIONALIZATION IN GVIM

INTERNATIONALIZATION IN GVIM INTERNATIONALIZATION IN GVIM A PROJECT REPORT Submitted by Ms. Nisha Keshav Chaudhari Ms. Monali Eknath Chim In partial fulfillment for the award of the degree Of B. Tech Computer Engineering UNDER THE

More information

Using non-latin alphabets in Blaise

Using non-latin alphabets in Blaise Using non-latin alphabets in Blaise Rob Groeneveld, Statistics Netherlands 1. Basic techniques with fonts In the Data Entry Program in Blaise, it is possible to use different fonts. Here, we show an example

More information

Chapter 2 Text Processing with the Command Line Interface

Chapter 2 Text Processing with the Command Line Interface Chapter 2 Text Processing with the Command Line Interface Abstract This chapter aims to help demystify the command line interface that is commonly used in UNIX and UNIX-like systems such as Linux and Mac

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete***

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete*** 1 of 5 3/3/2003 1:25 PM ****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete*** ISO INTERNATIONAL ORGANIZATION

More information

Hello INTRODUCTION TO XML. Neil Maycock. Pro-Bel Limited, UK ABSTRACT

Hello INTRODUCTION TO XML. Neil Maycock. Pro-Bel Limited, UK ABSTRACT INTRODUCTION TO XML Neil Maycock Pro-Bel Limited, UK ABSTRACT As the broadcasting world becomes ever more digital, computers systems are becoming more prevalent in many forms. This is bringing new challenges

More information

Geometry. Glossary. High School Level. English / Punjabi. Translation of Geometry terms based on the Coursework for Geometry Grades 9 to 12.

Geometry. Glossary. High School Level. English / Punjabi. Translation of Geometry terms based on the Coursework for Geometry Grades 9 to 12. High School Level Glossary Geometry Glossary English / Punjabi Translation of Geometry terms based on the Coursework for Geometry Grades 9 to 12. Word-for-word glossaries are used for testing accommodations

More information

ISO International Organization for Standardization Organisation Internationale de Normalisation

ISO International Organization for Standardization Organisation Internationale de Normalisation ISO International Organization for Standardization Organisation Internationale de Normalisation ISO/IEC JTC 1/SC 2/WG 2 Universal Multiple-Octet Coded Character Set (UCS) ISO/IEC JTC 1/SC 2/WG 2 N2381R

More information

The Xlint Project * 1 Motivation. 2 XML Parsing Techniques

The Xlint Project * 1 Motivation. 2 XML Parsing Techniques The Xlint Project * Juan Fernando Arguello, Yuhui Jin {jarguell, yhjin}@db.stanford.edu Stanford University December 24, 2003 1 Motivation Extensible Markup Language (XML) [1] is a simple, very flexible

More information

1 Lithuanian Lettering

1 Lithuanian Lettering Proposal to identify the Lithuanian Alphabet as a Collection in the ISO/IEC 10646, including the named sequences for the accented letters that have no pre-composed form of encoding (also in TUS) Expert

More information

OpenType Font by Harsha Wijayawardhana UCSC

OpenType Font by Harsha Wijayawardhana UCSC OpenType Font by Harsha Wijayawardhana UCSC Introduction The OpenType font format is an extension of the TrueType font format, adding support for PostScript font data. The OpenType font format was developed

More information

Punjabi WordNet Relations and Categorization of Synsets

Punjabi WordNet Relations and Categorization of Synsets Punjabi WordNet Relations and Categorization of Synsets Rupinderdeep Kaur Computer Science Engineering Department, Thapar University, rupinderdeep@thapar.edu Suman Preet Department of Linguistics and Punjabi

More information

Quarterly Programmatic Report (November, 2013 to February, 2014)

Quarterly Programmatic Report (November, 2013 to February, 2014) Quarterly Programmatic Report (November, 2013 to February, 2014) Contents Summary... 4 NVDA... 4 ESpeak... 4 Other Languages... 6 Beta testing... 6 2 Highlights During this quarter (November to February),

More information

Communication through the language barrier in some particular circumstances by means of encoded localizable sentences

Communication through the language barrier in some particular circumstances by means of encoded localizable sentences Communication through the language barrier in some particular circumstances by means of encoded localizable sentences William J G Overington 17 February 2014 This research document presents a system which

More information

Survey of Language Computing in Asia 2005

Survey of Language Computing in Asia 2005 Survey of Language Computing in Asia 2005 Sarmad Hussain Nadir Durrani Sana Gul Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Structured documents

Structured documents Structured documents An overview of XML Structured documents Michael Houghton 15/11/2000 Unstructured documents Broadly speaking, text and multimedia document formats can be structured or unstructured.

More information

Issues in Indic Language Collation

Issues in Indic Language Collation Issues in Indic Language Collation Cathy Wissink Program Manager, Windows Globalization Microsoft Corporation I. Introduction As the software market for India i grows, so does the interest in developing

More information

DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE

DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE 83 DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE Rajesh Das Biswajit Das Subhendu Kar Swarnali Chatterjee Abstract Unicode is a 32-bit code for character representation in a

More information

1. Introduction 2. TAMIL LETTER SHA Character proposed in this document About INFITT and INFITT WG

1. Introduction 2. TAMIL LETTER SHA Character proposed in this document About INFITT and INFITT WG Dated: September 14, 2003 Title: Proposal to add TAMIL LETTER SHA Source: International Forum for Information Technology in Tamil (INFITT) Action: For consideration by UTC and ISO/IEC JTC 1/SC 2/WG 2 Distribution:

More information

Andrew Glass and Shriramana Sharma. anglass-at-microsoft-dot-com jamadagni-at-gmail-dot-com November-2

Andrew Glass and Shriramana Sharma. anglass-at-microsoft-dot-com jamadagni-at-gmail-dot-com November-2 Proposal to encode 1107F BRAHMI NUMBER JOINER (REVISED) Andrew Glass and Shriramana Sharma anglass-at-microsoft-dot-com jamadagni-at-gmail-dot-com 1. Background 2011-vember-2 In their Brahmi proposal L2/07-342

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

The Unicode Standard Version 12.0 Core Specification

The Unicode Standard Version 12.0 Core Specification The Unicode Standard Version 12.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54 Overview Lecture 16 Introduction to XML Boriana Koleva Room: C54 Email: bnk@cs.nott.ac.uk Introduction The Syntax of XML XML Document Structure Document Type Definitions Introduction Introduction SGML

More information

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd.

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. This paper was presented at the ICADL conference December 2001 by Spectrum

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

L2/ Proposal to encode archaic vowel signs O OO for Kannada. 1. Thanks. 2. Introduction

L2/ Proposal to encode archaic vowel signs O OO for Kannada. 1. Thanks. 2. Introduction L2/14-004 Proposal to encode archaic vowel signs O OO for Kannada Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2013-Dec-31 1. Thanks I thank Srinidhi of Tumkur, Karnataka, for alerting me to these

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Implementing Web Content

Implementing Web Content Implementing Web Content Tonia M. Bartz Dr. David Robins Individual Investigation SLIS Site Redesign 6 August 2006 Appealing Web Content When writing content for a web site, it is best to think of it more

More information

Activity Report at SYSTRAN S.A.

Activity Report at SYSTRAN S.A. Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine

More information

The Unicode Standard Version 6.1 Core Specification

The Unicode Standard Version 6.1 Core Specification The Unicode Standard Version 6.1 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

The XML Metalanguage

The XML Metalanguage The XML Metalanguage Mika Raento mika.raento@cs.helsinki.fi University of Helsinki Department of Computer Science Mika Raento The XML Metalanguage p.1/442 2003-09-15 Preliminaries Mika Raento The XML Metalanguage

More information

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered. Content Enrichment An essential strategic capability for every publisher Enriched content. Delivered. An essential strategic capability for every publisher Overview Content is at the centre of everything

More information

Keyboards for inputting Chinese Language: A study based on US Patents

Keyboards for inputting Chinese Language: A study based on US Patents From the SelectedWorks of Umakant Mishra April, 2005 Keyboards for inputting Chinese Language: A study based on US Patents Umakant Mishra Available at: https://works.bepress.com/umakant_mishra/11/ Keyboard

More information

Introduction to Tools for IndoWordNet and Word Sense Disambiguation

Introduction to Tools for IndoWordNet and Word Sense Disambiguation Introduction to Tools for IndoWordNet and Word Sense Disambiguation Arindam Chatterjee, Salil Rajeev Joshi, Mitesh M. Khapra, Pushpak Bhattacharyya { arindam, salilj, miteshk, pb }@cse.iitb.ac.in Department

More information

Informatics 1: Data & Analysis

Informatics 1: Data & Analysis Informatics 1: Data & Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The University of Edinburgh Tuesday 11 February 2014 Semester 2 Week 5 http://www.inf.ed.ac.uk/teaching/courses/inf1/da

More information

097B Ä DEVANAGARI LETTER GGA 097C Å DEVANAGARI LETTER JJA 097E Ç DEVANAGARI LETTER DDDA 097F É DEVANAGARI LETTER BBA

097B Ä DEVANAGARI LETTER GGA 097C Å DEVANAGARI LETTER JJA 097E Ç DEVANAGARI LETTER DDDA 097F É DEVANAGARI LETTER BBA ISO/IEC JTC1/SC2/WG2 N2934 L2/05-082 2005-03-30 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation еждународная организация

More information

QDA Miner. Addendum v2.0

QDA Miner. Addendum v2.0 QDA Miner Addendum v2.0 QDA Miner is an easy-to-use qualitative analysis software for coding, annotating, retrieving and reviewing coded data and documents such as open-ended responses, customer comments,

More information

Annotation Science From Theory to Practice and Use Introduction A bit of history

Annotation Science From Theory to Practice and Use Introduction A bit of history Annotation Science From Theory to Practice and Use Nancy Ide Department of Computer Science Vassar College Poughkeepsie, New York 12604 USA ide@cs.vassar.edu Introduction Linguistically-annotated corpora

More information

MythoLogic: problems and their solutions in the evolution of a project

MythoLogic: problems and their solutions in the evolution of a project 6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. MythoLogic: problems and their solutions in the evolution of a project István Székelya, Róbert Kincsesb a Department

More information

CA Productivity Accelerator 12.1 and Later

CA Productivity Accelerator 12.1 and Later CA Productivity Accelerator 12.1 and Later Localize Content Localize Content Once you have created content in one language, you might want to translate it into one or more different languages. The Developer

More information

DOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI DOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI Page 1 Page 2 urdu hindi dictionary in devnagri script urdu hindi dictionary in pdf urdu hindi dictionary in devnagri script

More information

The Unicode Standard Version 11.0 Core Specification

The Unicode Standard Version 11.0 Core Specification The Unicode Standard Version 11.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

Midterm 1 Review Sheet CSS 305 Sp 06

Midterm 1 Review Sheet CSS 305 Sp 06 This is a list of topics that we have covered so far. This is not all inclusive of every detail and there may be items on the exam that are not explicitly listed here, but these are the primary topics

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Designing Methodologies of Tamil Language in Web Services C.

More information

Chapter 7. Representing Information Digitally

Chapter 7. Representing Information Digitally Chapter 7 Representing Information Digitally Learning Objectives Explain the link between patterns, symbols, and information Determine possible PandA encodings using a physical phenomenon Encode and decode

More information

SDMX self-learning package No. 5 Student book. Metadata Structure Definition

SDMX self-learning package No. 5 Student book. Metadata Structure Definition No. 5 Student book Metadata Structure Definition Produced by Eurostat, Directorate B: Statistical Methodologies and Tools Unit B-5: Statistical Information Technologies Last update of content December

More information

Survey of Language Computing in Asia 2005

Survey of Language Computing in Asia 2005 Survey of Language Computing in Asia 2005 Sarmad Hussain Nadir Durrani Sana Gul Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

XML 2 APPLICATION. Chapter SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC.

XML 2 APPLICATION. Chapter SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC. XML 2 APPLIATION hapter SYS-ED/ OMPUTER EDUATION TEHNIQUES, IN. Objectives You will learn: How to create an XML document. The role of the document map, prolog, and XML declarations. Standalone declarations.

More information

UniTerm Formats and Terminology Exchange

UniTerm Formats and Terminology Exchange Wolfgang Zenk UniTerm Formats and Terminology Exchange Abstract This article presents UniTerm, a typical representative of terminology management systems (TMS). The first part will highlight common characteristics

More information

PROPOSALS FOR MALAYALAM AND TAMIL SCRIPTS ROOT ZONE LABEL GENERATION RULES

PROPOSALS FOR MALAYALAM AND TAMIL SCRIPTS ROOT ZONE LABEL GENERATION RULES PROPOSALS FOR MALAYALAM AND TAMIL SCRIPTS ROOT ZONE LABEL GENERATION RULES Publication Date: 23 November 2018 Prepared By: IDN Program, ICANN Org Public Comment Proceeding Open Date: 25 September 2018

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set Action: For consideration by UTC and ISO/IEC JTC1/SC2/WG2 Author: Mohammad Mohammad Khair Date: 17-Dec-2018 Introduction:

More information

Survey of Language Computing in Asia 2005

Survey of Language Computing in Asia 2005 Survey of Language Computing in Asia 2005 Sarmad Hussain Nadir Durrani Sana Gul Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca

More information

Orange3-Textable Documentation

Orange3-Textable Documentation Orange3-Textable Documentation Release 3.0a1 LangTech Sarl Dec 19, 2017 Contents 1 Getting Started 3 1.1 Orange Textable............................................. 3 1.2 Description................................................

More information

Introduction to Text Mining. Aris Xanthos - University of Lausanne

Introduction to Text Mining. Aris Xanthos - University of Lausanne Introduction to Text Mining Aris Xanthos - University of Lausanne Preliminary notes Presentation designed for a novice audience Text mining = text analysis = text analytics: using computational and quantitative

More information

PROFESSIONAL TUTORIAL. Trinity Innovations 2010 All Rights Reserved.

PROFESSIONAL TUTORIAL. Trinity Innovations 2010 All Rights Reserved. PROFESSIONAL TUTORIAL Trinity Innovations 2010 All Rights Reserved www.3dissue.com PART ONE Converting PDFs into the correct JPEG format To create a new digital edition from a PDF we are going to use the

More information

CID-Keyed Font Technology Overview

CID-Keyed Font Technology Overview CID-Keyed Font Technology Overview Adobe Developer Support Technical Note #5092 12 September 1994 Adobe Systems Incorporated Adobe Developer Technologies 345 Park Avenue San Jose, CA 95110 http://partners.adobe.com/

More information

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013 Assured and security Deep-Secure XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013 This technical note describes the extensible Data

More information

Parser Design. Neil Mitchell. June 25, 2004

Parser Design. Neil Mitchell. June 25, 2004 Parser Design Neil Mitchell June 25, 2004 1 Introduction A parser is a tool used to split a text stream, typically in some human readable form, into a representation suitable for understanding by a computer.

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Preserving Non-essential Information Related to the Presentation of a Language Instance. Terje Gjøsæter and Andreas Prinz

Preserving Non-essential Information Related to the Presentation of a Language Instance. Terje Gjøsæter and Andreas Prinz Preserving Non-essential Information Related to the Presentation of a Language Instance Terje Gjøsæter and Andreas Prinz Faculty of Engineering and Science, University of Agder Serviceboks 509, NO-4898

More information

WYSIWON T The XML Authoring Myths

WYSIWON T The XML Authoring Myths WYSIWON T The XML Authoring Myths Tony Stevens Turn-Key Systems Abstract The advantages of XML for increasing the value of content and lowering production costs are well understood. However, many projects

More information

Categorizing Migrations

Categorizing Migrations What to Migrate? Categorizing Migrations A version control repository contains two distinct types of data. The first type of data is the actual content of the directories and files themselves which are

More information

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script Galaxy Bansal Dharamveer Sharma ABSTRACT Segmentation of handwritten words is a challenging task primarily because of structural features

More information

Installation BEFORE INSTALLING! Minimum System Requirements

Installation BEFORE INSTALLING! Minimum System Requirements Installation BEFORE INSTALLING! NOTE: It is recommended that you quit all other applications before running this program. NOTE: Some virus detection programs can be set to scan files on open. This setting

More information