L2/09-358R Introduction. Recommendation. Background. Rub El Hizb Symbol. For discussion at UTC and by experts. No action is requested.

Similar documents
Introduction. Requests. Background. New Arabic block. The missing characters

Standardizing the order of Arabic combining marks

Proposed Draft: Unicode Technical Report #53 UNICODE ARABIC MARK ORDERING ALGORITHM

Proposal for changes to ArabicShaping.txt to allow machine generation of Arabic fonts and glyphs. A. Generating Arabic glyphs from the Schematic Name

Proposal to encode fourteen Pakistani Quranic marks

The right hehs for Arabic script orthographies of Sorani Kurdish and Uighur

Character Set Supported by Mehr Nastaliq Web beta version

ARABIC LETTER BEH WITH SMALL ARABIC LETTER MEEM ABOVE MBEH (BEH WITH MEEM ABOVE) Represents mba sound

The Unicode Standard Version 6.1 Core Specification

The Unicode Standard Version 6.0 Core Specification

Nafees Nastaleeq v1.02 beta

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet

Nafees Nastaleeq v1.01 beta

Proposal to encode three Arabic characters for Arwi

ISO/IEC JTC/SC2/WG Universal Multiple Octet Coded Character Set (UCS)

Andrew Glass and Shriramana Sharma. anglass-at-microsoft-dot-com jamadagni-at-gmail-dot-com November-2

ISO/IEC JTC1/SC2/WG2 N3734 L2/09-144R

L2/ ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC

Comments on the Proposals to Encode Tamil Symbols and Fractions by ICTA Sri Lanka. The document

Multilingual mathematical e-document processing

Font Features for Lateef

2. Requester's name: Urdu and Regional Language Software Development Forum, Ministry of Science and Technology, Government of Pakistan

Two distinct code points: DECIMAL SEPARATOR and FULL STOP

Bengali Script: Formation of the Reph and Yaphala, and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER

A. Administrative. B. Technical General

L2/11-033R 1 Introduction

Unicode in Education. Adil Allawi Technical Director

JTC1/SC2/WG2 N

The Unicode Standard Version 6.1 Core Specification

A. Administrative. B. Technical General ISO/IEC JTC1/SC2/WG2 N1948

Proposal to encode productive Arabic-script modifier marks

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

ZWJ requests that glyphs in the highest available category be used; ZWNJ requests that glyphs in the lowest available category be used.

Arabic Text Segmentation

Yes. Form number: N2652-F (Original ; Revised , , , , , , , )

ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS 1

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete***

A Comparative Study of PDF Generation Methods:

Nastaleeq: A challenge accepted by Omega

Proposal to encode Devanagari Sign High Spacing Dot

Proposal to Encode Oriya Fraction Signs in ISO/IEC 10646

UNICODE IDEOGRAPHIC VARIATION DATABASE

The proposer gratefully acknowledges the help of Jony Rosenne in preparing this proposal.

The Unicode Standard Version 10.0 Core Specification

MONGOLIAN SCRIPT IN UNICODE SAN JOSE 2018

Microsoft Excel 2000 Charts

L.S.A.T. Auto-Analysis User Guide

The Unicode Standard Version 6.0 Core Specification

097B Ä DEVANAGARI LETTER GGA 097C Å DEVANAGARI LETTER JJA 097E Ç DEVANAGARI LETTER DDDA 097F É DEVANAGARI LETTER BBA

Introduction. Acknowledgements

Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore

Rendering in Dzongkha

ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC

University of Waterloo. CS251 Final Examination. Spring 2008

1. Introduction 2. TAMIL DIGIT ZERO JTC1/SC2/WG2 N Character proposed in this document About INFITT and INFITT WG

Information Retrieval of Text with Diacritics

ISO/IEC JTC1/SC2/WG2 N2641

Kannada 2. L2/ Representation of Jihvamuliya and Upadhmaniya in Kannada Srinidhi

Ꞑ A790 LATIN CAPITAL LETTER A WITH SPIRITUS LENIS ꞑ A791 LATIN SMALL LETTER A WITH SPIRITUS LENIS

5c. Are the character shapes attached in a legible form suitable for review?

ZWJ/ZWNJ. behavior under Indic scripts with special reference to chillu, conjuncts, etc in Malayalam. Rajeev J Sebastian Rachana Akshara Vedi

Web Content Accessibility Guidelines 2.0 Checklist

1 Action items resulting from the UTC meeting of December 1 3, 1998

Others Symbols, Additional characters proposed to Unicode. Azzeddine Lazrek

UC Irvine Unicode Project

ISO/IEC JTC 1/SC 2/WG 2/N2789 L2/04-224

1. Question about how to choose a suitable program. 2. Contrasting and analyzing fairness of these two programs

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document.

Comments on responses to objections provided in N2661

Form number: N2352-F (Original ; Revised , , , , , , ) N2352-F Page 1 of 7

Proposal to Encode Some Outstanding Early Cyrillic Characters in Unicode

Web Content Accessibility Guidelines 2.0 level AA Checklist

Because of these dispositions, Ireland changed its vote to Yes, leaving only one Negative vote (Japan)

Internationalized Domain Names Variant Issues Project

transcribing Urdu or Arabic words. Accordingly, the KHHA and GHHA should be considered atomic, as Tibetan TSA, TSHA, and DZA are.

Joiners (ZWJ/ZWNJ) with Semantic content for words in Indian subcontinent languages

Web Content Accessibility Guidelines 2.0 Checklist

FAO Web Content Accessibility Guidelines

WCAG 2.0 Checklist. Perceivable Web content is made available to the senses - sight, hearing, and/or touch. Recommendations

Use of ZWJ/ZWNJ with Mongolian Variant Selectors and Vowel Separator SOURCE: Paul Nelson and Asmus Freytag STATUS: Proposal

Learn how to use bold, italic, and underline to emphasize important text.

A. Administrative. B. Technical -- General

Web Content Accessibility Guidelines (WCAG) 2.0 Statement of Compliance

INTERNATIONALIZATION IN GVIM

VPAT Web Content Accessibility Guidelines 2.0 level AA

ISO/IEC JTC1/SC2/WG2. Universal Multiple-Octet Coded Character Set (UCS) - ISO/IEC Secretariat: ANSI

Proposal on Handling Reph in Gurmukhi and Telugu Scripts

Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound.

JTC1/SC2/WG2 N4945R

The Unicode Standard Version 11.0 Core Specification

A B Ɓ C D Ɗ Dz E F G H I J K Ƙ L M N Ɲ Ŋ O P R S T Ts Ts' U W X Y Z

Bopomofo Ruby. implement and requirement. Bobby Tung. Founder of Wanderer Digital Publishing Inc. Consultant of Taiwan Digital Publishing Forum

Proposal to encode ADLAM LETTER APOSTROPHE for ADLaM script

ISO/IEC JTC1/SC2/WG2 N4210

Google 1 April A Generalized Unified Character Code: Western European and CJK Sections

Writing Domain Names in Different Arabic Script Based Languages

Polygons and Angles: Student Guide

DATE: Time: 12:28 AM N SupportN2621 Page: 1 of 9 ISO/IEC JTC 1/SC 2/WG 2

1 ISO/IEC JTC1/SC2/WG2 N

Transcription:

L2/09-358R 2009-10-28 Title: Action: Authors: Discussion document for polishing Koranic support in Unicode For discussion at UTC and by experts. No action is requested. Roozbeh Pournader Date: 2009-10-28 Introduction Although we are almost there, Unicode still has a few ambiguities and missing characters for properly encoding the modern day representation of Koranic text. Experts have different opinions on how this should be fixed. This document tries to point to the existing problems, as of Unicode 5.2, and make some recommendations. Recommendation Change the properties of U+06DE ARABIC START OF RUB EL HIZB from that of a combining mark to a normal spacing symbol (properties could be copied from U+06E9 ARABIC PLACE OF SAJDAH); Encode four missing characters (three open tanweens and one combining version of small waw); Suggest a solution for missing carriers and horizontally-stacking harakat in the text of the standard (for example, recommend the use of U+0640 ARABIC TATWEEL and U+00A0 NO-BREAK SPACE as carriers). Background At a recent talk by Thomas Milo at the Unicode and Internationalization Conference 33, titled The Unicodebased Koran: a Conflict Between Calligraphic Tradition and Computer Typography brought this issue back into the author s agenda. At the talk, Thomas Milo explained some of the existing problems, and proposed his solution based on his ideal Arabic character model. Unfortunately, changing the Unicode Arabic model for basic characters, like U+0621 ARABIC LETTER HAMZA, may be impossible this late in the standardization process. The author is using the opportunity to bring into attention the other issues he has found with Unicode Koranic support, and also try to explain the still-existing point of frustration of people who try to encode Koranic text. Rub El Hizb Symbol To help the devout Muslims recite the whole Koran during the holy month of Ramadan, the Koran is divided into thirty parts. These parts, called juz,(جزء) are different from the traditional 114 surahs :(سورة) they divide the Koran into roughly equal amounts of text, so the devout can spend almost equal amount of time each day for their recitations. For further help timing, each juz is further divided into two hizbs,(حزب) and each hizb into four rub -al-hizbs الحزب).(ربع In printed Korans, this sectioning is typically specified in the margins, with a marker inside running text to point the exact location of one rub -al-hizb ending and another starting (see Figure 1). The Unicode character U+06DE ARABIC START OF RUB EL HIZB is that in-text marker. In printed Korans, is appears in running text by itself, usually adjacent to an end-of-ayah marker. The rendering behavior is the same as U+06E9: the character just sits there and does not interact with other adjacent text. Discussion document for polishing Koranic support in Unicode Page 1 of 5

But in Unicode, U+06DE is specified to be a combing character. The author believes this to be a simple mistake, since it was encoded next to U+06DD ARABIC END OF AYAH (which was originally a combining character too). Looking at the reference glyph for U+06DE somehow confirms this too: there is a dotted circle in the glyph, but there is no way something could fit in there. Figure 1. Sample from a Koran printed in Iran. The character U+06DE ARABIC START OF RUB EL HIZB is visible at the beginning of line -2, after an End of Ayah marker. The large decoration at the left margin of the page helps readers find the inline marker easier. The text in the center of the decoration reads half of hizb 36 الحزب ( ٣٦,(نصف indicating that the reader is at approximately 71/120 of the whole text of the Koran. Missing characters For encoding the most common modern form of published Koranic text, four characters are missing. Three are open forms of tanween, which mark a pronunciation difference with normal tanweens, already encoded at U+064B..064D. The other is a combining companion of U+06E5 ARABIC SMALL WAW, needed for encoding some words (compare to the pair U+06E6 ARABIC SMALL YEH and U+06E7 ARABIC SMALL HIGH YEH). The three open tanweens have been proposed at least twice, by Thomas Milo in L2/01-325 and by Jonathan Kew in L2/02-275. There is no indication that the first document was ever on the UTC table, and although the second document appears on the agenda for UTC #92, there is no mention of it in the minutes. These may not have ever been discussed in the UTC. There is also a third document which appears like a Unicode proposal but was never submitted for consideration of the committee. It s Arabeyes s Proposal to add four Arabic characters to the BMP of the UCS and fix four characters written by Mohammad Yousif and Nadim Shaikli, publicly available at http://arabeyes.org/~nadim/tmp/unicode_quran_prop.pdf The open tanweens point toward a different pronunciation of tanween in Koranic text. While a normal tanween would be pronounced as [an], [in], or [un] with a clear [n] sound (called izhār), an open tanween will hint toward either nasalization of the [n] sound (called ikhfā ), or its total disappearance, geminating the next consonant (called idqām). Discussion document for polishing Koranic support in Unicode Page 2 of 5

Figure 2. Symbol legend in Arabic language, appearing at the end of a Koran published in Iran. A normal tanween is shown first, saying it marks izhār.(إظهار) Then an open tanween is shown,.(إخفاء) ikhfā and (إدغام) saying it marks idqām Figure 3. Examples of the three open tanween characters as opposed to normal tanweens, from L2/02-275 (green = fathatan, blue = dammatan, purple = kasratan, ignore red). Note that intelligent Koranic software, in order to determine the open shape of the Kasratan on the last line (if not provided), would need to skip over at least three characters that compose the marker: an End of Ayah character and two digits. Discussion document for polishing Koranic support in Unicode Page 3 of 5

Theoretically, the shape of the tanween could be determined by the next pronounced consonant, and advanced reciters even learn the rules to determine it themselves. But determination of the next pronounced consonant would need skipping over various characters to be able to determine the next pronounced consonant. The skipped-over characters may include spaces, unpronounced alefs (possibly with some combining marks), symbols like START OF RUB EL HIZB and PLACE OF SAJDA, END OF AYAH symbols with their pack of following digits, page breaks, or sura breaks. The author believes that while this is achievable in software, it is overkill for text rendering engines who want to reflect the actual textual content of Koranic text. The other missing character, ARABIC SMALL HIGH WAW, is exemplified in the Arabeyes document, although not as a new character. The character appears mid-word in Koranic text, as shown in Figure 4 (lefthand). U+06E5 ARABIC SMALL WAW cannot be used for this purpose, as it is a non-joining spacing character, and would result in the first part of the word getting disconnected from the second part. Figure 4. On the right side, there are two examples of spacing U+06E5 ARABIC SMALL WAW at the end of each word. On the left side, there is a combining version (applied to either a seen, or a tatweel, and followed by a small madda), not encoded yet. Also note the visual difference between ARABIC SMALL HIGH WAW (typically indicating a long [u:] sound) and a normal U+064C ARABIC DAMMA (typically indicating a short [u]). Missing carriers and horizontally-stacking harakat Modern printed versions of the Koran prefer keeping the original word skeletons the same as the oldest versions of Koran found, when there were no dots or harakat used. For example, the left-hand part of Figure 4, the word liyasu u, will probably be written as something like ليسؤوا in a modern Arabic orthography for modern text. Not changing the skeleton, is keeping with the tradition of writing the Koran, and some Muslim consider the original skeleton shapes the only authentic shapes, as the dots and the harakat are later additions and subject to interpretations and disagreements. (More examples can be seen in Figure 5.) But keeping the skeleton intact has required the fully-marked-up Korans to inserts marks in places where a mark usually doesn t appear in modern Arabic text. In the example word, after the seen, a reciter will read a damma, a small high waw, a (small high?) madda, a hamza (above?), and another damma before arriving at the next letter, a waw. There are three visual stacks of harakat visible. The first stack has the first damma, the second stack has the small high waw and madda, and the third stack consists of a hamza and the second damma. There are different approaches to encode that text: 1. Encode all the characters as combining marks over seen, and say that the three stacks happen because of stylistic reasons. (This will result in over-stacking and ineligibility in Koranic texts rendered by normal software, and may be unachievable because of the effect of normalization on Koranic text, since normalization may change the order of harakat). Sample string: <..., seen, damma, small high waw, small high madda, hamza above, damma, waw,...> Discussion document for polishing Koranic support in Unicode Page 4 of 5

2. Specify that for breaking stacks, a special character like U+034F COMBINING GRAPHEME JOINER should be used. (This will solve the normalization issue of case 1, but will still result in over-stacking in non-specialized sofware.) Sample string: <..., seen, damma, CGJ, small high waw, small high madda, CGJ, hamza above, damma, waw,...> 3. Change the behavior of characters like U+0621 ARABIC LETTER HAMZA etc. to support such dualjoining behavior in Koranic text, or encode new characters for hamza etc. with the new behavior. Sample string: <.., seen, damma, (dual-joining) high waw, small high madda, (dual-joining?) hamza, damma, waw,...> 4. Specify that a U+0640 ARABIC TATWEEL or U+00A0 NO-BREAK SPACE should be used to carry such harakat. (This is not semantically correct, but has the least changes of the existing Unicode Arabic model, and existing software could continue to display in an acceptable rendering.) Sample text: <..., seen, damma, tatweel, small high waw, small high madda, tatweel, hamza above, damma, waw,...> An example needing NBSP could be seen in Figure 5. Figure 5. On the right side, U+0670 ARABIC LETTER SUPERSCRIPT ALEF (a combining character), with an additional madda, is seen standing alone between a hamza and a waw-hamza (this would need the use of NBSP in the fourth approach). On the left side, a combining hamza is seen between a lam and an alef forming a ligature. If the fourth approach is taken, the text will be rendered as something like ل ای ة (acceptable) with normal Unicode-complying software, while Koranic software can create a ligature. Acknowledgments The author wishes to thank Thomas Milo, Adil Allawi, Behdad Esfahbod, and the Arabeyes free software community for fruitful discussions. Some samples are taken from papers and documents by Thomas Milo and Jonathan Kew. Discussion document for polishing Koranic support in Unicode Page 5 of 5