Summary. Corpus Collection and Topic Identification for Punjabi

Size: px

Start display at page:

Download "Summary. Corpus Collection and Topic Identification for Punjabi"

Fay Tate
6 years ago
Views:

2 Summary The aim of this project is to create a corpus of Punjabi news topics and then create a topic identification algorithm that can be applied to the corpus. This has involved the creation of: Web spider to automatically collect large numbers of articles, A font-encoding to Unicode converter for Gurmukhi text, A script to convert XHTML files to TEI Lite files, A Unicode Gurmukhi to loss-less ASCII converter for Gurmukhi text, A simple Punjabi words stemmer, A script to convert TEI Lite files to WEKA compatible ARFF files. This report details how an extensive set of tools was designed to create a normalised, three million word corpus. It also details how a topic identification algorithm was used to select features from the corpus that could be used with machine-learning tools to correctly categorise articles. Tools were developed using Java (for the web spider) and.net C# (all other tools and scripts). 2

3 Acknowledgments I would like to thank: Katja Markert for the many meetings she has had with me, and the delays she has had to put up with! Eric Atwell, my assessor. My eight housemates who kept me sane, albeit for their constant distractions. 3

4 1. Contents 1. Contents Introduction Objectives Minimum Requirements Deliverables Schedule Background Punjabi Language Scripts Morphology Gurmukhi Script Alphabet Miscellaneous Signs Conjuncts Unicode Other Encodings Corpora Representativeness Utilising Corpora Collecting Data Representing Data Computing Difficulties in Creating a Corpus Existing Tools for Punjabi Text Categorisation and Topic Identification Methodology Creating a Corpus Pipeline Data Sources EMILLE Corpus TDIL ISCII Corpus Websites Conclusion Data Collection Designing a Web Spider Implementation Testing Further Analysis and Improvements Using the Spider Unicode Conversion The Problem Existing Tools Design Implementation Testing Second Iteration Design Implementation of Additional Features Converting HTML Files to Unicode

5 5.5. Storage Format Design Implementation Corpus Statistics Topic Identification Algorithm Pipeline Design Conversion to ARFF Stemming Spelling Variations Word Exclusions Implementation Evaluation Corpus Appropriateness of Pages Collected by the Spider Representativeness of the Corpus Validity of TEI Files Accuracy of Unicode Conversion Topic Identification Algorithm Conclusion References Appendix...48 A. Reflection...48 B. Syllable Separation Algorithm...49 C. Example Mapping File...50 D. IEF Code Chart...52 E. Ajit Weekly Categories...54 F. TEI Sample File...55 G. Punjabi Word Stop List...57 H. Example ARFF File...58 I. WX Notation...59 J. J48 Decision Tree Results...60 K. Naïve Bayes Results...64 L. Ajit Weekly A CD-ROM with tools and files created during this project is included. Read the accompanying readme.txt file for further details. 5

6 2. Introduction The aim of this project is to collect a corpus of various Punjabi news topics, and develop and evaluate an algorithm for automatic topic identification for Punjabi Objectives The objectives of this project are: Studying ways of collecting and encoding Gurmukhi text. Investigating natural language processing techniques when applied to Punjabi. Learning about topic identification and adapting its use for Gurmukhi text Minimum Requirements The minimum requirements for this project are: Background NLP work on corpora, text categorisation, Punjabi and Gurmukhi. Spider to collect corpus from internet resources. Corpus conversion and annotation. Basic topic identification algorithm (based on unigrams only). A report evaluating the accuracy of the topic identification algorithm. Possible enhancements: Expansion to cover related languages such as Hindi, Bengali and Gujarati. Expansion to cover other Punjabi scripts such as Shahmukhi. Extensive morphological pre-processing for the topic identification algorithm Deliverables The deliverables for this project are: A corpus of Punjabi news topics. This corpus is not necessarily going to be available for redistribution. A topic identification algorithm. A report documenting the project. 6

7 2.4. Schedule October November 2005 December 2005 January 2006 January February 2006 March 2006 April 2006 April May 2006 Background research Initial prototyping of web spider Finalise web spider Corpus collection and annotation Develop topic identification algorithm Evaluate algorithm Finalise report A cross indicates that the deadline for the task was missed. A tick indicates that the deadline was achieved. 7

8 Corpus Collection and Topic Identification for Punjabi 3. Background 3.1. Punjabi Language The Punjabi (or Panjabi) language originates from the Punjab areas of both India and Pakistan. It is an Indo-European language spoken in all its dialects 1 by over one hundred million people (Gordon, 2005). It is the official language of the Indian state of Punjab and is the language most widely spoken in the Punjab province of Pakistan, even though it has no official state patronage there. Uniquely, for an Indo-Aryan language, Punjabi is tonal in nature with a high, level and low tone (Grierson, 1927). It is similar and shares much grammar with the neighbouring Hindustani language (the vernacular form of Hindi and Urdu) Scripts Punjabi can be written in a number of different scripts. The eastern variant or Indian Punjabi is written in the Gurmukhi script. The western variant or Pakistani Punjabi is written in the Nasta liq style of the Perso-Arabic alphabet known as Shahmukhi. Punjabi may also be written in Devanagari or even Latin. ਪ ਜ ਬ प ज ब Panjābī ਗ ਰਮ ਖ द वन गर Latin ہ Gurmukhi Devanagari Latin Shahmukhi This project will concentrate on Punjabi text in the Gurmukhi script Morphology Morphology is the study of the structure of words in a language and how they are altered and created. Punjabi is an agglutinative language and words are derived by adding affixes to words (Bhatia, 1993). By studying the morphology of words and by using stemming and lemmatisation, it is easier to analyse and annotate text. Lemmatisation is used to reduce a set of words down to their lexeme or root. It enables more efficient processing of text because words such as kicks, kicked and kicking are all reduced to a single word: kick. Algorithms are then applied directly to the lexeme, which may give more accurate results than applying it to each individual word. Like other Indian languages, Punjabi assigns gender to inanimate objects in addition to treating males as masculine objects and females as feminine objects. 1 This includes both Eastern and Western Punjabi (Lahnda), and Siraiki. 8

9 There are no rules to determine whether inanimate objects are feminine or masculine. However, most nouns ending in [ī] are feminine and most nouns ending in [ā] are masculine. There are exceptions to this rule which can only be determined through experience or by using a dictionary. Plurals are formed by changing the ending of words. Masculine nouns ending in [ā] change into [ē] to form the plural. Those feminine nouns ending in [a] or [ī] change to W [āṁ], and those ending in [ā] change to ਵ [vāṁ] (Kalra, 2003). There are two forms of adjectives: variable and invariable. Variable adjectives end in [ā] for masculine nouns or [ī] for feminine nouns. They inflect based on the noun type, although this is not always the case (Kalra, 2003). For example, the root ਵ ਡ [vaḍḍ], meaning big, has the following forms: Masculine singular: ਵ ਡ [vaḍḍā] Masculine plural: ਵ ਡ [vaḍḍē] Feminine singular: ਵ ਡ [vaḍḍī] Feminine plural: ਵ ਡ ਆ [vaḍḍīāṁ] Prefixes are used predominantly on nouns, and suffixes are used on both nouns and verbs (Bhatia, 1993). For example the word ਬ ਸ [baṃs], meaning family, can be converted to mean the whole family by adding the prefix ਸਰ- [sar-], meaning whole : ਸਰ- + ਬ ਸ ਸਰਬ ਸ [sar-] whole [baṃs] family [sarbaṃs] whole family Words can be formed by using both prefixes and suffixes: ਖ਼ਬਰ + - ਖ਼ਬਰ [ḵẖabar] news [-ī] -ness [ḵẖabarī] awareness ਬ - + ਖ਼ਬਰ ਬ ਖ਼ਬਰ [bē-] un- [ḵẖabarī] awareness [bēḵẖabarī] unawareness ignorance Negation is indicated by using prefixes. The prefixes ਬ - [bē-], ਨ - [nā-], ਅ- [a-], and ਅ - [aṃ-] are all used depending on the form and origin of the word. Verbs can be formed from nouns by adding suffixes such as -ਨ [-nā] and -ਣ [-ṇā]. For example: ਬ ਲ + -ਣ ਬ ਲਣ [bōl] word [-ṇā] [bōlṇā] to speak 9

10 3.2. Gurmukhi Script The Gurmukhi script is the main script used to write the Punjabi language and is the only script extensively taught for the sole purpose of writing Punjabi. It is the official script of the Indian state of Punjab and is the basis for work on this project. The Gurmukhi script is an abugida that descended from the Brahmi script of ancient India (Kalra, 2003). The script was standardised by the second Sikh guru, Angad Dev, in the sixteenth century. The name Gurmukhi literally means from the mouth of the Guru (Gill, 1996). An abugida is a script that is composed of a series of consonants which include an inherent vowel. In the case of Gurmukhi, the inherent vowel is altered using diacritics known as a vowel signs. As such, the basic characters represent syllables and not consonants. For example, in the Latin script, the letter K simply represents the consonant K. In Gurmukhi there is no basic sign for the letter K. Instead there is the letter [ka] which is represented using ਕ, and can be modified into other syllables by attaching a vowel sign such as: [ū] to become ਕ [kū] or [ī] to become ਕ [kī]. The inherent vowel, as indicated above, is [a] and is not pronounced at the end of a word Alphabet Gurmukhi consists of thirty-five distinct characters, of which three are vowel sign bearers (Gill, 1996). Gurmukhi characters do not have a one-to-one mapping with the standard Latin alphabet, and therefore a special transliteration scheme is used. The selected transliteration scheme complies with the conventions recommended in ISO 15919:2001 (Stone, 2004). However, for reasons of clarity, the final unpronounced [a] is not transliterated. ੳ ਅ ੲ ਸ ਹ - [a] - [sa] [ha] ਕ ਖ ਗ ਘ ਙ [ka] [kha] [ga] [gha] [ṅa] ਚ ਛ ਜ ਝ ਞ [ca] [cha] [ja] [jha] [ña] ਟ ਠ ਡ ਢ ਣ [ṭa] [ṭha] [ḍa] [ḍha] [ṇa] ਤ ਥ ਦ ਧ ਨ [ta] [tha] [da] [dha] [na] ਪ ਫ ਬ ਭ ਮ [pa] [pha] [ba] [bha] [ma] ਯ ਰ ਲ ਵ ੜ [ya] [ra] [la] [va] [ṛa] 10

11 The following letters are used to represent sounds not present in the Punjabi language and are used in loan words (Kalra, 2003). These are created by placing a Pairin Bindi (literally, with a dot in the foot ) onto existing, similar sounding consonants (Unicode Consortium, 2003). ਸ਼ ਕ ਖ਼ ਗ਼ ਜ਼ ਫ਼ ਲ਼ [śa] [qa] [ḵẖa] [ġa] [za] [fa] [ḻa] There are two forms of vowels used in Gurmukhi: independent vowels and dependent vowel signs. Independent vowels are constructed using the first three characters of the alphabet, known as vowel bearers. With the exception of ਅ [a] they do not represent anything on their own and cannot be used without additional vowel signs. They are used to represent vowel sounds where using a vowel sign is not suitable, such as at the beginning of a word. In cases where the inherent vowel needs to be altered for example, from [a] to [ī] a vowel sign is attached to the consonant. The following table lists the Punjabi name, independent form, dependent form and transliteration for each vowel: Muktā ਅ - [a] Kannā ਆ [ā] Sihārī ਇ [i] Bihārī ਈ [ī] Auṅkaṛ ਉ [u] Dulaiṅkaṛ ਊ [ū] Lānvāṁ ਏ [ē] Dulānvāṁ ਐ [ai] Hōṛā ਓ [ō] Kanauṛā ਔ [au] In addition to the consonants and vowel signs, there are several special characters used in Gurmukhi Miscellaneous Signs Gurmukhi uses two signs to indicate nasalisation: Tippi [ṃ] and Bindi [ṁ]. Essentially these two signs have the same function but are used in different settings. Tippi is used with the inherent vowel [a], independent and dependent forms of [i], and the dependent forms of [u] and [ū]. Bindi is used in other cases (Gill, 1996). The sounds often used in pronouncing Tippi and Bindi are similar to the n in the English suffix ing. 11

12 The Adhak sign is placed before a consonant to indicate that it is a geminate (reinforced or doubled). The effect is thus: ਗਡ ਗ ਡ = [gaḍī] = [gaḍḍī] ਸਖ ਸ ਖ = [sikhī] = [sikkhī] Conjuncts Conjuncts are used to represent consonant clusters i.e. groupings of more than one consonant. There are four main conjuncts used in modern Gurmukhi, of which the first two are by far the most common: ਰ ਰ [ra] = ਹ [ha] = ਵ [va] = ਯ [ya] = ਯ They attach to existing consonants and replace the inherent vowel. For example: ਸਰ ਸਰ = [sarī] = [srī] For other consonant clusters, it is left to the reader to determine when the inherent vowel is dropped. In some specialised situations such as a dictionary a sign known as Halant or Virama is used to explicitly kill the vowel. The following two examples are identical in terms of pronunciation and meaning: ਸ ਰ ਸਰ = [srī] = [srī] The use of the Halant or Virama is a concept borrowed from other Indic scripts and is only rarely used in Gurmukhi Unicode The Unicode Standard 4.0 includes seventy-seven code points for Gurmukhi in the Gurmukhi block from U+0A00 to U+0A7F. This includes all the characters mentioned above, in addition to some archaic and specialised characters. Gurmukhi also uses the characters Danda (U+0964) and Double Danda (U+0965) from the Devanagari block to delimit sentences. As encoded in Unicode, Gurmukhi is classified as a complex script and requires special rendering technology to be usable. Gurmukhi is supported by the main modern text rendering engines, including Uniscribe (Microsoft Windows), ICU (Linux and other software) and ATSUI (MacOS X). Unicode requires Gurmukhi text to be encoded in logical as opposed to visual order. It is then rearranged by rendering software to appear in the correct order. 12

13 Logical Order Visual Order ਕ + ਕ [ka] U+0A15 [i] U+0A3F Unicode does not assign separate code points for conjoined consonants. Instead it uses a special character known as Virama to induce conjoined consonant behaviour. [ki] ਸ + ਰ + ਸਰ [sa] U+0A38 [ra] U+0A3F [ī] U+0A40 [sarī] ਸ + + ਰ + ਸਰ [sa] U+0A38 Virama U+0A4D [ra] U+0A3F [ī] U+0A40 Finally, although Ura, Aira and Iri are encoded in the Gurmukhi sub-range, using such characters with vowel signs to make independent vowels is not recommended. Instead, pre-composed independent vowels are encoded which should be used instead Other Encodings Although Unicode is now used as the de facto standard for storing text on modern computers, its use for Gurmukhi is limited (although constantly increasing). In the past, the ISCII encoding scheme was used for Gurmukhi which formed the basis for Gurmukhi support in Unicode. Conversion from ISCII to Unicode is relatively straight forward due to the similarities in encoding. The predominant encoding for Gurmukhi is the use of fonts such as AnmolLipi and DrChatrikWeb. These function by masking ASCII characters so they appear as Gurmukhi characters and all have differing mappings. For example, under the AnmolLipi font, the Latin character A would appear as ਅ. Conversely, under the DrChatrikWeb font it would appear as ੳ. They are used because they require no complex rendering (rendering support for Unicode Gurmukhi was scant until recently). This project aims to standardise on Unicode because it is an internationally recognised standard that is supported by a large number of operating systems. [srī] 13

14 3.3. Corpora In its simplest form, a corpus (from Latin, meaning body ) is a collection of more than one text (McEnery, Wilson, 2001). It can comprise of both monolingual and multilingual content. Corpora may be either annotated or unannotated. An annotated corpus includes metadata (information about the data), whereas an unannotated corpus contains merely the raw text. Some multilingual corpora are formatted so that related words are aligned throughout different sets of languages. These are known as parallel aligned corpora and fall under the umbrella of annotated corpora Representativeness In creating a corpus, it is necessary for it to be representative of the language or the task at hand. It is preferable that a broad sample of text is used so that the final corpus accurately depicts the variety in language. Obviously, not all utterances or acceptable sequences of text will be present, but a corpus based on representative samples is likely to have a wide and balanced mix of language. However, the use of corpora is countered by linguists such as Noam Chomsky who comment that they can never be fully representative of real language (McEnery, Wilson, 2001). Chomsky believed that, by their very nature, corpora can never explain the inventiveness of language; many perfectly valid sentences will never be included while some invalid sentences will. To fulfil the requirements of this project, a monolingual sample of news articles collected using a web spider will be required. In addition, if suitable text is found is existing corpora, it should be used too. To be representative, the articles must be modern, and from differing authors Utilising Corpora Corpora provide an ideal body of text that linguists can use to analyse the nuances of human language. Electronic corpora (which are almost exclusively the only corpora that are now in widespread use) can be analysed extensively by computer programs. Corpora are relevant because they allow linguists (and others interested in language) to analyse a vast body of text using complicated algorithms. It allows them to analyse a language without having to conduct real life surveys. A corpus can be used to provide large amounts of information on language grammar, spelling, conventions and morphology. Complex algorithms can be applied to extract higher-level information about the content or meaning of the text, rather than statistics on the text itself. This project will be utilising data from a corpus to categorise text. 14

15 Collecting Data Traditionally, corpus data was either collated by hand or typed up for the specific purpose of including it in a corpus. Nowadays, corpora creation is considerably more automated. The use of the internet has allowed extensive corpora to be created with considerably less effort. In the context of major world languages such as English, Russian, Arabic and Chinese, simple unannotated corpora can be collected with comparative ease because there are standardised encodings for the scripts used. This is not the case for Punjabi. In addition to the many scripts used, there are also multiple encodings for the Gurmukhi script. Until recently, the only way of representing Gurmukhi text on web pages was by using fonts that mask ASCII characters and show them as Gurmukhi characters. Different fonts have different mappings to ASCII characters and no search engines index this as Gurmukhi text. Since the introduction of Windows XP with its Unicode support for Gurmukhi, increasing numbers of websites using Unicode have appeared. However, their presence is miniscule compared to the vast array of websites encoded using proprietary fonts. The creators of the EMILLE corpus encountered problems with the unavailability of standardised text and with considerable amounts of text in images (Singh, 2000). Fortunately, the use of images for Gurmukhi text has largely subsided in recent years Representing Data In the past, different corpora would use different storage formats. In recent years there has been a move to standardise the formats used for corpora and this has been spearheaded by the Text Encoding Initiative (TEI). The TEI uses SGML (Standard Generalised Markup Language) and increasingly XML (extensible Markup Language). It provides DTDs which can be used to validate the conformity of the markup used. The TEI markup is quite complex when completely implemented. However, a stripped down version known as TEI Lite P4 is available for use. A subset of TEI Lite XML is likely to be used to store any corpus data collected for this project because it is simple to use, and more advanced features of the standard TEI format are not needed Computing Difficulties in Creating a Corpus A large corpus can be very technically challenging. There is the initial need to find sources of data (websites, newspapers, books or speech), and then automate a way to collect it all. In the case of extracting data from websites, a web robot or spider is required. Once the data has been collected, it must be cleaned (removing non-body text such as navigation elements or advertisements), converted to Unicode (or another appropriate encoding) and then the required text needs be extracted and placed in a TEI file. It is relatively simple to do such tasks by hand, but when collecting a corpus of thousands or even hundreds of pages of text, it is simply too time consuming to be economical. 15

16 Existing Tools for Punjabi Existing corpora for Punjabi are limited. There exists an ISCII encoded, unannotated corpus (created by the Indian Ministry of Information Technology) which has approximately three million words. 2 It includes a varied array of topics, including some news articles. The EMILLE corpus, created by Lancaster University, includes over fifteen million Punjabi words from a variety of sources, all encoded in Unicode. Of this, three million are in Shahmukhi script which makes them unsuitable for this project. Both of these corpora may be suitable for use in experimenting with topic identification. Their suitability will be discussed later. The International Institute of Information Technology, India, 3 has created a morphological analyser for Punjabi that runs on Unix-based systems. It takes a list of Punjabi words in Roman WX notation and returns the root word with additional morphological information. Roman WX has a one-to-one relationship with ISCII and should not present a problem in terms of converting encodings. There are a number of high quality multi- and monolingual Punjabi dictionaries available in print form. However, the presence of dictionaries of any quality in electronic form is limited. There are a number of basic dictionaries (or more appropriately word lists) available online, but there is nothing that compares to the breadth of quality of print dictionaries. 2 This figure was calculated by estimating the number of words in the 14 MB corpus from a sample word count in a 200 KB file. 3 Available from 16

17 3.4. Text Categorisation and Topic Identification Automatic text categorisation is the process of assigning a document to one or more categories based on its contents. It is a topic in natural language processing that has a seen a significant increase in interest. There are several approaches that are commonly employed when categorising text. One such approach uses decision trees. This method gives weightings to words and the number of times they occur, and then uses the weightings and a decision tree to categorise documents. This is a simple approach that can be highly effective. (Manning, Shütze, 2000) Maximum entropy is a technique used to calculate probability distributions from data. This can be applied to text categorisation and is an area of active research. Much like the use of decision trees, a maximum entropy model is applied to a document whose words have been weighted. A set of pre-tagged training data is used to estimate suitable constraints. These constraints are then used to estimate the probability of non-tagged data belonging to a particular category. (Nigam, et al, 1999) Maximum entropy modelling works by observing the features of the training data and formulating constraints based on the original data. It assumes nothing about the unknown data and aims to formulate a model which is uniform and factually consistent. (Berger, 1996) The k nearest neighbour classification is a relatively simple method used in natural language processing. It classifies a document by finding the most similar document (the nearest neighbour) in a training set and assigning its category to the new document. (Manning, Shütze, 2000) All the machine learning algorithms used above require some sort of pre-processing before text can be categorised. It is this pre-processing or feature selection that is crucial in correctly categorising text. Simply using all words in the differing documents would not only be less accurate, but painfully slow. Techniques such as stemming (reducing all words to their root form), stop-lists (removing common words) and spelling unification (treating similarly spelled words as one) can all increase the accuracy of a machine learning algorithm that is attempting to categorise a document. There are also different ways that terms can be weighted. Some text categorisers may use complex algorithms to weigh words, whilst others may just indicate whether a word is present or not in a particular document. WEKA, created by the University of Waikato, implements many differing machine learning algorithms (Witten, Frank, 2005). It is often used in text categorisation work to test the effectiveness of different algorithms. 17

4. Methodology A software development methodology is required to standardise the development of any tools for this project. Three methodologies were evaluated for use: (Bennett, et al.

18 4. Methodology A software development methodology is required to standardise the development of any tools for this project. Three methodologies were evaluated for use: (Bennett, et al., 2002) Waterfall A concentration on the traditional life cycle with requirements analysis, design, construction, testing, and installation stages. Waterfall with iteration As above, but with the ability to return to any previous stage and alter subsequent stages as a result of changes in the development process. Unified software development process (USDP) Comprises four distinct phases: inception, elaboration, construction and transition. Different workflows are concentrated on in different phases, but the workflows are constantly updated as time passes. The following diagram illustrates this: (Krutchten, 2001) None of the methodologies listed are entirely satisfactory for this particular project. They are all geared towards the creation of an information system solution as their main product. The aim of this project is to primarily create a corpus and secondarily implement a topic identification algorithm. As such, it was decided that the waterfall model with iteration (when required) would be the most appropriate method for the overall project. It allows a structure development process to occur, with the advantage of being able to go back if required and enhance or change the solution. 18

19 5. Creating a Corpus 5.1. Pipeline To convert the bare HTML files to TEI Lite requires several individual, but automated steps. Together they form a pipeline of processes. The diagram below indicates the processes involved, and any output. 1. Web Spider 2. HTML Tidy 3. Metamorph 4. TEI Script Font-based HTML File Font-based XHTML File Unicode XHTML File Information File XML TEI Lite Corpus File The steps involved in each stage will be explained later Data Sources A discussed previously, a corpus of categorised Punjabi texts will be created so that a topic identification algorithm can be applied to them later. In addition, existing data from two corpora will be evaluated: EMILLE corpus TDIL ISCII corpus These will be evaluated to ensure they are encoded accurately, contain news articles and have the appropriate categories required. Websites will be reviewed to see which are the most suitable for extracting data from for our corpus. 19

5.2.1. EMILLE Corpus The EMILLE corpus was initially thought to be a good base to extract data from because it contained several news articles in Punjabi.

20 EMILLE Corpus The EMILLE corpus was initially thought to be a good base to extract data from because it contained several news articles in Punjabi. However a further examination revealed the following problems: No articles were appropriately tagged with a genre or topic that is necessary for the topic identification component of this project. Articles were not properly separated. Instead large groups of article were all stored as one big block of text and so it would have been difficult to split them up. There were huge inconsistencies and errors in the Unicode encoding. By far the predominant reason that EMILLE was not used was the inconsistencies in the text encoding. Examples: The dotted circles indicate major errors in the Unicode text caused by an incorrect conversion from the source text. In this example a consonant has consistently been converted to. There has also been no rearrangement of the components of a syllable (see section 3.2). 20

In this example, the character Tippi is always encoded as a Bindi. Geminate consonants are not encoded using Adhak but instead as consonant clusters using Virama.

This example shows considerable corruption that has occurred when converting the text to Unicode. 5.2.

21 In this example, the character Tippi is always encoded as a Bindi. Geminate consonants are not encoded using Adhak but instead as consonant clusters using Virama. These errors are likely to stem from an incorrect conversion from ISCII to Unicode. This example shows considerable corruption that has occurred when converting the text to Unicode TDIL ISCII Corpus The ISCII encoded corpus from TDIL was inappropriate because it was by no means an overly news based corpus nor did it have any markup on the text to indicate genre or topic. There were only a few news pages which rendered it unsuitable for use as part of the topic identification portion of this project. 21

22 Websites The internet has emerged as a phenomenal international corpus with a wide variety of data. There were several possible choices available for use. The main features looked for were: Large news archive, News categorisation, Separated news articles, Print preview ability (simplifies the removal of superfluous elements such as navigation bars). A few well known news sites were evaluated: Website Advantages Disadvantages Ajit Weekly Pages categorised into forty DrChatrikWeb categories. Very large news archives. Separated news articles. Ajit Jalandhar Satluj Quami Ekta DrChatrikWeb and Unicode Sanjh Savera DrChatrikWeb Amritsar Times DrChatrikWeb 5abi.com Unicode Print preview version. Large news archives. Pages categorised into geographic locations. Some pages encoded in Unicode. Most pages are categorised. Separated news articles. Print preview version. Large news collection. Most pages encoded in Unicode. Large news collection. Encoded in Satluj font. No print preview version. No print preview version. No categorisation of text. Small collection of news. No print preview version. No categorisation of text. No uniform design aspect to facilitate data extraction. No print preview version. No categorisation of text. The most suitable choices were Quami Ekta and Ajit Weekly. Quami Ekta was considered inappropriate because it did not contain anywhere near as many articles as Ajit Weekly, nor were the articles categorised in a way that would enable easy unification with the Ajit Weekly categories Conclusion It was a rather clear-cut decision that the only way to create a suitable corpus for this project was to do so by collecting text from a large news web site. The EMILLE corpus had deficiencies in encoding, article separation and categorisation. The TDIL corpus did not comprise of the appropriate content matter required for this project. The Ajit Weekly web site was selected as a suitable site from which data could be extracted for the reasons stated above. 22

23 5.3. Data Collection To facilitate data collection, it would be necessary to automate the collection of web pages. Creating a large corpus manually is not a feasible idea because of the considerable amount of time it would take. There are several web spiders or web crawlers available online. However, most are geared towards generating search indexes. The ones geared towards corpora work such as WebCorp (RDUES, 2005) and Bootcat (Baroni M, et al., 2005) were not suitable because they were concentrated on random web page collection based on Google keywords. They also did not have features to restrict saved pages based on URL parameters. It was decided that for ultimate customisability, the best option would be to create a web spider suitable for this task Designing a Web Spider Two separate development languages were considered: Java and C#. Both are very similar syntactically and many of the class sets also have similarities. Both had high-level classes that simplified downloading using the HTTP protocol. Although there was no clear cut superior language for the task, it was decided that Java would be the most appropriate language because of its cross-platform support and large user base. There are two types of web spider that could be used: a depth-first spider or breadth-first spider. (Eddy and Haasch, 1996). The diagram below illustrates how a depth-first recursive spidering algorithm operates for a depth of 3 (0 being the root). 1 index.html step1.html about.html extra.html 7 step1.html step2.html index.html about.html 2 6 step2.html step3.html index.html about.html step3.html about.html extra.html index.html 8 23

24 1. Download and parse first link in root. 2. Download and parse first link in child Download first link in child Download next link in child 2. In this case, index.html has already been downloaded, so download the next link after that. 5. Download next link in child 2. In this case, all links have been downloaded. 6. Download and parse next link in child 1. In this case, all links have been downloaded. 7. Download and parse next link in root. In this case, about.html has already been downloaded, so download the next link after that. 8. Continue The diagram below illustrates a breadth-first implementation, where the spider downloads all linked pages on a webpage at a particular depth, before moving onto the next depth level: index.html step1.html about.html extra.html step1.html step2.html index.html control.html about.html index.html extra.html index.html step2.html step3.html index.html about.html control.html index.html In this diagram one can see how all links are retrieved at the next depth (unless the file has already been downloaded). Both a breadth-first and depth-first implementation of the web spider will be created. The one which results in the most appropriate results will be used to collect the corpus. The web spider should save any web pages it retrieves, plus an additional information file containing the source URL, the date and time the page was retrieved Implementation There are four separate classes that shall be required: Main Holds the entry function and joins all other classes together. Downloader Contains the download engine to facilitate the collection of web pages via HTTP. Spider The spidering algorithm itself. Webpage Contains a particular web page with additional metadata. 24

25 The methods for each of these classes are detailed below: Webpage methods: void addchild(webpage) Add the specified web page as a child of the current web page. Webpage getchild(int) Retrieve the child web page using its index number. void removechild(int); void removechild(webpage) Remove the child web page using its index number or directly using the object reference. int getchildcount() Get the number of child web pages. string getcontent(); setcontent(string) Get/set the HTML text of the web page in string format. byte[] getcontentbytes(); setcontentbytes(byte[]) Get/set the web page as a byte array. int getdepth(); setdepth(int) Get/set the depth of the current web page. string geturl(); seturl(string) Get/set the URL of the current web page. void savepage(string) Save the current byte version of the web page to disk. Downloader methods: byte[] downloadpage(string) Downloads the specified web page and returns a byte array. string[] extracturls(string) Extracts all the URLs contained within the specified HTML text and returns a string array. Spider methods: void startbreadthspider(string, string, int) Initiates the breadth-first spider using a source URL, a save to path and a maximum depth limit. void startdepthspider(string, string, int) Initiates the depth-first spider using a source URL, a save to path and a maximum depth limit. To address unforeseen issues, it may be necessary to add different methods or alter the methods listed above. However, the general structure will remain the same Testing The implementation of depth-first spider was evaluated using various test sites. The depth-first spider was highly effective at downloading web pages, but it suffered from one major flaw. If a web page was a leaf node due to it being at the maximum depth for that branch, it would be 25

26 flagged as being downloaded. Thus if the spider encountered this page again further up the tree, it would not search its contents because it had already been flagged as downloaded and processed. This issue was alleviated by also flagging the depth at which a web page was first downloaded. If that depth was the maximum depth and the web page was reencountered further up the tree, it would be parsed again for URLs but would not be saved to disk again. Further testing revealed another problem with this implementation. If a page had been flagged as being downloaded close to a leaf node and then was encountered further up the tree, it would not be processed. This is because its child nodes have already been processed. However, if it occurs further up in the tree, then there may be more child nodes to spider below. This issue could also have been alleviated by forcing the spider to continually retrieve all pages in its node list, even if they had been previously spidered. However, this would lead to an exponential increase in the number of pages downloaded or checked. Alternatively, the pages could be kept in memory and re-parsed. This breadth-first implementation is more effective because it ensures all pages are parsed to their maximum depth. For example, in the breadth-first diagram above, in the file step2.html you see the link to about.html. This will not be downloaded because about.html has already been parsed. However, the children of about.html will still be spidered to the maximum depth. In the depth-first approach, about.html would have been downloaded later on in the tree. This would have resulted in its children being spidered to a lower depth Further Analysis and Improvements As a result of the testing, it was seen that the breadth-first implementation of the web spider was by far the most effective. As a result, this will be used to collect the corpus. Ajit Weekly contains special printer-friendly article formats which are only used for proper articles and not for miscellaneous information pages on the web site. Thus, using only the printerfriendly version ensures that articles are downloaded without additional formatting or layout. This will hopefully reduce any post-processing that needs to be done on the articles. To ensure that the spider will only save the printer-friendly versions of articles, an extra parameter was introduced which restricted what pages were saved based on the content of the URL. To put this into effect for Ajit Weekly, only web pages which contained the string read_printable.asp will be saved (although all pages encountered will be downloaded and processed) Using the Spider After the final version of the spider was tested and compiled, it was instructed to collect a corpus using the following command: java -cp. spider.main c:\spider\ajitweekly 500 read_printable.asp The parameters are: source web site, local save to address, maximum depth and must contain text. The spider collected 7,024 pages and reached a depth of 133 before the entire domain was spidered. 26

27 5.4. Unicode Conversion The source data is not encoded in the Unicode format. For reasons of data interchange stability, it will be necessary to convert it to Unicode before it can be processed further The Problem A font-encoding has no rules and simply treats Gurmukhi as an alphabet. This is contrary to rendering engines which implement Gurmukhi Unicode. They treat Gurmukhi as a syllable-centric script with strict enforcement of orthographic rules. Because a font-encoding does not enforce any rules, a completely identical syllable may be represented in multiple ways. In some cases there may be over ten identical ways to encode the same syllable. It is absolutely vital to ensure the corpus represents text in a normalised way i.e. there should be only one way to represent a syllable. Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult. If one takes the hypothetical syllable: ਸਰ = [srēṁ] In Unicode there is no ambiguity as to how this is encoded. In font-encodings there are at least six distinct ways of representing this syllable. This is because every diacritic is zero-spacing which means they can be added in any order after the consonant and still appear correctly. Unicode: ਸ + + ਰ + + Font encoding: ਸ + ਰ + + ਸ + ਰ + + ਸ ਰ ਸ + + ਰ + ਸ ਰ ਸ + + ਰ + This problem is exasperated by the fact that it is nearly impossible to visually detect when the same non-spacing character is repeated. It is also difficult to detect when a larger non-spacing character overlaps a smaller one. For example: ਸ = ਸ ਸ + + = ਸ ਸ + + = ਸ In these examples, the most visually apparent sign is shown in the final syllable. 27

28 These issues can cause a large number of errors when trying to computationally process Gurmukhi text. For example, in terms of topic identification, if several differing authors had created documents with different typing orders (but all the same category), the machine learning algorithm would be unable to resolve the fact that many of the words with different encodings are the same. This can lead to significantly reduced rates of accuracy when categorising documents. A comprehensive solution is required to fix these problems whilst also converting the text to Unicode Existing Tools Unicodify was created by Lancaster University for collecting the EMILLE corpus and includes support for converting several fonts (AnmolLipi, Satluj and others) into Unicode (Hardie, 2004). Preliminary testing has revealed that although a one-to-one conversion using Unicodify was fairly accurate, it did not include any error correction or rearrangement features necessary to normalise the text. The Gurmukhi Unicode Conversion Application (GUCA) was developed by the author of this report, and is released by the Punjabi Computing Resource Centre (Sidhu, 2004). GUCA was considerably more accurate than Unicodify, but it still did not include sufficient error correction. As a result, it will be necessary to implement an improved solution to ensure the corpus is accurate and normalised. GUCA is open-source which enables much of the code to be reused for this project Design The programming language selected for the development of this program (called Metamorph ) is C# on the.net framework. Java was not used because Metamorph may make use of C# code from GUCA. C# is also cross-platform when used with Mono. GUCA uses a linear conversion algorithm which is perfectly adequate for most text conversions. However, this is not likely to be the best solution for a script such as Gurmukhi when analysis of a syllable is required to fix errors. The first task is to fully analyse the components and correct ordering of a Gurmukhi syllable. It is composed of: Consonant + Conjoined Consonant(s) + Vowel Sign + Nasal or Auxiliary Signs Or as a.net regular expression: (C) (N)* (HC(N)*)* (V)* (X)* Where C = consonant, N = pairin bindi [nukta], H = halant [virama], V = vowel sign and X = other signs. * indicates 0 or more of the previous expression. An independent vowel is represented using a vowel bearer (represented as a consonant) and a vowel sign. The only compulsory component of a syllable is the consonant (the vowel [a] is inherent within a standalone consonant). 28

29 The next task is to separate the font-based text into syllables. Because vowel signs can come before the consonant, it is necessary to take this into account. This can be achieved using a specialised algorithm as shown in appendix B. Although Ajit Weekly only uses the DrChatrikWeb font, if designed properly, Metamorph could easily include the ability to convert other fonts in the future. The vast array of differing fonts means that anyone wishing to create a large corpus in the future will require several converters. To address this issue, an intermediary encoding format (IEF) is required. Texts encoded using differing fonts would initially be converted to a single intermediary encoding. The algorithm could then be applied exclusively to the intermediary encoding: DrChatrikWeb AnmolLipi Satluj Intermediary Encoding Unicode DrChatrikWeb AnmolLipi The diagram below shows how this is done in practice: ƒ ਨ + + AnmolLipi 0192 Gurmukhi IEF F128 + F174 + F17C Unicode 0A28 + 0A42 + 0A70 ƒ DR Chatrik Web 0192 ù Satluj 00F9 An ideal format to store information about mappings (i.e. what byte sequence turns into another byte sequence) is XML. Not only is it simple to use, but it allows future extensibility. Using XML files gives the advantage of being able to add new mappings without recompiling or reinstalling the program and allows users to add their own customised mappings. See appendix C for details of the file format. This approach also allows for the introduction of font-to-font converters via the IEF. For details of the IEF, see appendix D. Once the file has been converted to the IEF, it can then be converted into Unicode. For this to be done, Metamorph must classify each individual syllable component and then re-order it based on the logical encoding rules of the Unicode Standard. 29

4. Implementation There is only one class required for the core of the implementation, and that is referred to as the ConversionEngine.

30 For example, the font encoded text sequence: + ਸ + + ਰ Is currently ordered: vowel sign, consonant, nasal sign and conjoined consonant. When converted to Unicode, this should be in the order: consonant, conjoined consonant, vowel sign and nasal sign Implementation There is only one class required for the core of the implementation, and that is referred to as the ConversionEngine. Its methods are detailed below: string[] ListMappings(string) Returns an array of filenames to all mapping files in the specified directory. MappingDetails LoadMapping(string) Loads the mapping file specified. string Convert(string, MappingDetails) Convert the specified string to another encoding using the specified mapping file. string ConvertToUnicode(string) Converts the specified IEF-encoded string into Unicode. The MappingDetails class contains information about the mappings obtained from an XML file. It shall contain metadata (author, copyright, etc.) and a list of one-to-one mappings. A simple user interface was created that allowed a user to enter text at the top, which would be converted and shown at the bottom. In addition, an Options dialog allows users to alter settings such as font size and installed mappings Testing This initial implementation was tested once a mapping file for DrChatrikWeb was created. The conversion was as expected, but no error correction procedures were implemented. Nor was there any easy way to convert the masses of HTML text automatically. To address these outstanding issues, a second iteration commenced. 30

31 Second Iteration Design As mentioned earlier, it is crucial that the conversion utility also has some basic, but highly effective, error correction procedures. These are made to repair very common (but fixable) errors that can occur using font-encodings. Metamorph should make the following corrections: Automatic correction of nasal signs based on the accompanying vowel, Correcting invalid combinations of a vowel bearer and vowel sign, Selecting the most visible vowel sign if two or more vowel signs overlap, Removing duplicated vowel signs, Removing duplicated conjuncts which are not used in Gurmukhi. The first correction feature (nasal sign selection) is only suitable for modern text. Archaic Gurmukhi text (pre-1950s) may not necessarily follow this convention and so it is not always suitable. Fortunately, all the text on Ajit Weekly is modern, being a few years old at most. Metamorph also needs to be expanded to support conversion of HTML files. Without the ability to automatically convert the files, it would take considerable time and effort to convert each individual text element. There are two approaches available to process HTML files: Process as SGML Convert to XHTML and process as XML The first approach has the advantage that the HTML does not need to be altered to be processed. However this is offset by the fact that SGML is more difficult to process and that the.net framework does not offer an SGML parser. The second approach requires a utility for conversion into XML. A cross-platform application called HTML Tidy (Raggett, 2005) does this efficiently and is available free. Parsing valid XML is easier than parsing SGML and so this approach will be taken. It will be necessary for Metamorph to convert XML tags based on variables such as tag names and attribute values. To ensure this will be as versatile as possible, regular expressions will be supported when selecting appropriate tags Implementation of Additional Features The ConversionEngine class has been expanded to include an ErrorCorrection property, which causes certain checks and repair features to be activated. A BatchProcessor class has been introduced to handle the processing of a large number of XML files. It features options to select tags based on regular expressions applied to tag names and attribute values. 31

It also has general options to select which mapping to use, and to indicate how the files should be renamed: These new additions were extensively tested to ensure that the output text was as desired.

Converting HTML Files to Unicode Detecting the font used in a particular HTML tag is no trivial task.

32 It also has general options to select which mapping to use, and to indicate how the files should be renamed: These new additions were extensively tested to ensure that the output text was as desired. The batch processor was tested to ensure only the appropriate portions of the XML file were converted and that that output was valid XML Converting HTML Files to Unicode Detecting the font used in a particular HTML tag is no trivial task. There are many ways to specify fonts: cascading style sheets (CSS), style attributes and font tags. Any imbedded tags inherit the styling of parent tags, unless they specifically override the style. Fortunately, in the case of Ajit Weekly, all formatting was done via CSS. This required simply checking the class attribute of the font tag. This was simplified further by the fact that the spider only saved print preview versions of articles. The XML conversion facility created in Metamorph enables users to specify regular expressions to determine which XML tags to convert. For Ajit Weekly, a simple regular expression was required. Tag: ^font$ Attribute: class Attribute Value: ^drc Metamorph then processed the files, converting text in any tags that matched the regular expression queries above. 32

Punjabi Indic Input 2 - User Guide

Punjabi Indic Input 2 - User Guide Punjabi Indic Input 2 - User Guide 2 Contents WHAT IS PUNJABI INDIC INPUT 2?... 3 SYSTEM REQUIREMENTS... 3 TO INSTALL PUNJABI INDIC INPUT 2... 3 TO USE PUNJABI INDIC