Issues in Khmer Unicode 4.0

Size: px

Start display at page:

Download "Issues in Khmer Unicode 4.0"

Matthew McDowell
5 years ago
Views:

1 Issues in Khmer Unicode 4.0 Javier Solá Abstract Some changes have been introduced in Khmer Unicode 4.0 standard order of components which render it non compatible with Unicode 3.0 or which introduce ambiguity in the standard order for modern Khmer, permitting different orders that lead to the same graphic representation. These and other aspects of Khmer Unicode 4.0 are considered here, leading to a proposal for a new standard order of components that - still permitting type old Khmer forms - eliminates the ambiguity and the backwards compatibility problems. 1. Introduction This document covers the following points. One of the components of Khmer script is the Consonant Shifter, a character that modifies the sound of the consonant it relates to. In the Khmer Unicode 3.0 standard order of components this character was placed after the Base Consonant and after any subscript consonants (a second or third consonant within the Orthographic syllable). In Unicode 4.0 the Consonant Shifter is placed after the Base consonant, but before any subscript consonants. Any text written following Unicode 3.0 rules is no longer compatible with Unicode (in version 4.0, its present form). This type of situation is what in theory Unicode is trying to avoid. It should therefore be corrected, including both placements in the next version of Unicode, so that the next version of Unicode is compatible with Unicode 3.0 and Unicode 4.0. A constructions that do not exist in Khmer has been included in the standard order of components, it consists on the inclusion of the Robat sign after a subscript consonant, something that never happens in Khmer. Also, after all vowels and signs, a Khmer coeng consonant is included once again at the end of the standard order of components. This situation never occurs in modern Khmer, but was used in some special cases in old Khmer. Its inclusion leads to ambiguity, allowing users of modern Khmer to code words in two different ways, getting the same representation. This, of course complicates possible collation, searching and spelling. A solution that still allows the coding of old Khmer is proposed. The accompanying text of the Unicode standard includes and gives examples of the use of the Zero Width non-joiner character in a location not specified in the standard order of components. It should be included. Javier Solá Open Forum of Cambodia 1 Version /21/2004

2 2. Background In Unicode 3.0, a Khmer orthographic syllable is considered to be of the form: B {S}* {C} {V} {O} Where 1 : B is a consonant or independent vowel S is a subscript consonant or independent vowel sign C is a consonant shifter V is a dependent vowel O is any other Khmer sign In most of the cases, this form agrees with the way Cambodians spell in their language. Nevertheless, Unicode 4.0 defines the standard order of components in a Khmer orthographic syllable as expressed in BNF as: B {R C} {S {R}}* {{Z} V} {O} {S} Where: B is a base character (consonant character, independent vowel character, and so on) R is a robat C is a consonant shifter S is a subscript consonant or independent vowel sign V is a dependent vowel sign Z is the zero width non-joiner is any other sign Furthermore, the text (page 281) says that a Z (zero-width non-joiner) can also be placed before the C (consonant shifter), making the final form for Unicode 4.0: B {R {Z} C} {S {R}}* {{Z} V} {O} {S} 1 These are not the names or abbreviations that Unicode 3.0 gives to the components. In order to be able to compare with Unicode 4.0, the names and abbreviations used to represent the Unicode 3.0 standard order of components are the same that are used in Unicode 4.0 Javier Solá Open Forum of Cambodia 2 Version /21/2004

3 3. Coeng (subscript) consonants at the end of the Syllable. Khmer Unicode order is based in Khmer spelling order, which is normally different from hand-writing order. In spelling order, in modern day Khmer, vowels are always placed after coeng (subscript) consonants, as it is the last coeng consonant the one whose sound is continued by the vowel. Unicode 4.0 locates coeng consonants between the base consonant and the vowel (its traditional location, as well as its Unicode 3.0 location), but it also includes a second placement at the very end of the standard order of components, in order to be compatible with old forms of Khmer. This leads to ambiguity in modern Khmer, as words with vowels and coeng consonants can be written in two different ways. Following this rule, the word ក could be spelled in two different ways, leading to identical representation: The Unicode 3.0 way (Khmer spelling order) ka + coeng + ta + ii Placing the vowel before the coeng consonant ka + ii + coeng + ta which, of course leads to extreme difficulty for searching, collation and spelling algorithms. The ambiguity could be solved, at the same time that rare old forms are allowed, by allowing the final coeng consonant only when preceded by a ZWJ (ZERO WIDTH JOINER) character, thus allowing old forms, but making sure that no mistakes or ambiguities will exist in modern Khmer. With this change, an old form that uses a coeng consonant after a vowel should use the ZWJ character, as in: ទង to + a + nikahit + ZWJ + coeng + ngo [ both (= ទង )] The standard order of components would change from B {R C} {S {R}}* {{Z} V} {O} {S} to B {R C} {S {R}}* {{Z} V} {O} {ZJ S} where ZJ stands for ZERO WIDTH JOINER Javier Solá Open Forum of Cambodia 3 Version /21/2004

4 Issues in Khmer Unicode Robat after a Coeng (subscript) consonant. The Unicode 4.0 Book says about robat. The Khmer sign robat historically corresponds to the Devanagari repha, a representation of syllable-initial r-. However, the Khmer script can treat the initial r- in the same way as the other initial consonants namely, a consonant character ro and as many subscript consonant signs as necessary. There are old loan words from Sanskrit and Pali including robat, but in some of them, the robat is not pronounced and is preserved in a fossilized spelling. Because robat is a distinct sign from the consonant character ro, the Unicode Standard encodes U+17CC KHMER SIGN ROBAT while it treats the Devanagari repha as a part of a ligature without encoding it. The authoritative Chuon Nath dictionary sorts robat as if it were a base consonant character, just as the repha is sorted in scripts that use it. The consonant over which robat resides is then sorted as if it were a subscript. Examples of consonant clusters beginning with ro and robat: ចរ ro + aa + co + ro + coeng + sa + ii [rè'crsei] king hermit យ qa + aa + yo + robat [paqrya] civilized ( រយ, qa + aa + ro + coeng + yo) ពតមន po + ta + robat + mo + aa + no [pmqdtmè'n] news Robat is used for loan words that have been taken from Pali and Sanskrit. It is interesting to look at the complete list of words included in the authoritative Chuon Nat dictionary that use robat: កកដ ទយស ពណន វសទពណ អធម កប រ ទយយធម ពតមន វសគ : អនយ កណ ទសធម ពណន សកដមគ អន យកធម កបស ធម ពពណន សងខតធម អយត ធម គភ នយកធម ពធបកខយធម សមបណ អឃ ឆកមវចរស គ នយយនកធម ពយធធម សគ : អជ ន ជងឃមគ នវរណធម ម ទបព សពជញ អថ ជតធម បញច ពណ មគ សពងគ អសងខតធម តបធម បរប រណ មគ ស ពជញ ឃ ត យត ន បបធម យត ធម សទធ ថ ថ ទកខ ពត ប ណម មពណ សជ វធម យន ទដ ធម ប ព លអកពណ សពណ ឧនម គ ទគត ប ពទស វណ ស គ ទ គម ប ពនមត វបរ មធម ស គ ទជ ន បកខរពស វបយយ ធម ទពល ពណ វបយស ហមពណ Javier Solá Open Forum of Cambodia 4 Version /21/2004

5 We can see very quickly that none of them has coeng consonant in the same orthographic syllable as the robat (nor a superscript vowel). Also, when រ appears in Chuon Nat, it is never followed by two subscript consonants. The words in this dictionary that include រ with a subscript consonant are: ករ នស យចរយ ពទធ ចរយ វឌ ចរយ អនយតរ ថយ ករ សពទ បព ជ ចរយ មងគល ទពចរយ ទធ ចរយ អ ទ រយ ករ បចឆ ចរយ មហ ច រយ សរ ចរយ គនថចរនចរយ បដ រយ ម ហស រយ ស រយ រយ ជតកចរយ តរ ថយ ពរ ឡន ពទ រយ មករ វរ មន ស រយកន ម រ រយក សរ ពស ទរ ភក ពទ រយ វរ មត អនសចរយ This is probably due to the fact that in most words that include two coeng consonants, the second one is (only 10 exceptions to this rule in all of Khmer). If words with sound combinations similar to the one in the English word Arthritis (R + TH + R) were to be brought into Khmer (it would require the រ ថ or another similar combination) we have to assume that they will be written using modern Khmer form រ ថ and not robat, as in ថ. All this leads to the fact that there are no words in Khmer that include an orthographic syllable that combines robat and a coeng consonant, nor is there a reasonable possibility of them being created. Therefore the second robat present in the standard order of components in Unicode 4.0, after the first (or second) coeng consonant, is unnecessary and something that does not exist in Khmer and is not desirable to permit, therefore the standard order of components should change from: B {R C} {S {R}}* {{Z} V} {O} {ZJ S} to: B {R C} {S}* {{Z} V} {O} {ZJ S} Javier Solá Open Forum of Cambodia 5 Version /21/2004

6 5. Placement of the consonant shifter in the standard order of components The change on the location of the consonant shifter (CS) in the standard order of components from Unicode 3.0 to Unicode 4.0 has broken the Unicode standard for Khmer, making specifications and fonts written for Unicode 3.0 non compatible with Unicode 4.0. In Unicode 3.0 the CS was placed after the base consonants and coeng consonant, in a location that fits spelling order. Khmer speakers always consider the CS in this position, as before writing the coeng consonant, they do not know where it will be placed (physically). There are no cases in the Chuon Nat dictionary in which the CS is combined with two coeng consonants. B {S}* {C} {V} {O} - Unicode 3.0 (C stands for consonant shifter) Unicode 4.0 has moved the location in which the CS has to be typed to a location before the coeng consonant. By doing these, it has rendered non Unicode-compatible all specifications and files written before Unicode 4.0: B {R C} {S}* {{Z} V} {O} { S} - Unicode 4.0 From a compatibility point of view, the next version of Unicode should accept the CS in both positions, in order to be backwards compatible with Unicode 3.0 and 4.0, leading to: B {R C} {S}* {C} {{Z} V} {O}{ZJ S} 2 Leaving the technical standards discussion aside, and as the Khmer language is concerned, the discussion on which place the CS should occupy is a complicated one, as the CS when accompanying a base consonant and a coeng consonant - sometimes affects the base consonant, as in the cases of ស ស ង and ម ក គក and in other words it affects the coeng consonant, as in បន 3. In the case of the word បន (ប + ន in Unicode 3.0), if the CS was placed before (ប + ន ), it would affect ន, and it would not have to be a, but a (only can affect ន, and only will it shift correctly to the subscript form). In other words, the wrong character would have to be written in order to have the correct glyph (or all fonts need to be re-developed and re-distributed, which is what this standard tries to avoid). 2 There are never two coeng consonants in combination with a consonant shifter in the same orthographic syllable. 3 This is independent from the fact that the user will probably think about placing the CS after writing the base and the coeng consonant (because at that point, if it the CS changes to the subscript form, he knows where to write it). Javier Solá Open Forum of Cambodia 6 Version /21/2004

7 The solution of allowing the CS in two places seems to be the most correct one in both technical and orthographic terms. Of course, allowing the same character in two different locations leads to ambiguity in the cases in which the base consonant and the coeng consonant belong to the same series, but the number of cases is small enough and seems like a minor problem compared to having to use the wrong character is some cases. Again, maybe there is a better solution. 6. Zero width non-joiner Besides the use of the ZERO WIDTH NON-JOINER to avoid consonant-vowel ligatures, and therefore placed in the location indicated by the standard order of components, page 282 of the Unicode 282 book says that: If either muusikatoan or triisap needs to keep its superscript shape (as an exception to the general rule where other superscripts typically force the alternative subscript glyph for either character), U+200C ZERO WIDTH NON-JOINER should be inserted before the consonant shifter to show the normal glyph or a consonant shifter when the general rule requires the alternative glyph. In such cases, U+200C ZERO WIDTH NON-JOINER is inserted before the vowel sign. If we integrate this in the standard order of components, it will give us: B {R {{Z} C}} {S}* {{Z} C} {{Z} V} {O} {ZJ S} 7. Conclusion In several cases, either because the standard has been broken, because the attempt to include old Khmer forms has lead to ambiguity in the order of characters that should be accepted, or because comments in the text do not fit the sequence, it is necessary to modify the standard order of components for Khmer Unicode. The final order of components should be: B {R {{Z} C}} {S}* {{Z} C} {{Z} V} {O} {ZJ S} Where: B is a base character (consonant or independent vowel character) R is a robat C is a consonant shifter S is a subscript consonant or independent vowel sign V is a dependent vowel sign Z is the zero width non-joiner ZJ is the zero width joiner is any other sign Javier Solá Open Forum of Cambodia 7 Version /21/2004

Khmer Angkor Keyboard

Khmer Angkor Keyboard Contents Overview... 2 Khmer Angkor Keyboard Layouts... 2 Desktop Layout Windows/macOS... 2 Touch Layout Android/iOS... 3 Khmer Character Categories and Keystrokes for Desktop...