JTC1/SC2/WG2 N3915. Foreword for this preliminary proposal. 1. Introduction

Similar documents
ISO/IEC JTC1/SC2/WG2 N4078

Introduction. Acknowledgements

ꟸ A7F8 LATIN SUBSCRIPT SMALL LETTER S ꟹ A7F9 LATIN SUBSCRIPT SMALL LETTER T ꟺ A7FA LATIN LETTER SMALL CAPITAL TURNED M

5c. Are the character shapes attached in a legible form suitable for review?

ISO/IEC JTC1/SC2/WG2 N4210

ISO/IEC JTC1/SC2/WG2 N4445

1. Introduction. 2. Proposed Characters Block: Latin Extended-E Historic letters for Sakha (Yakut)

1. Introduction. 2. Encoding Considerations

5 U+1F641 SLIGHTLY FROWNING FACE

ISO/IEC JTC1/SC2/WG2 N4212

ISO/IEC JTC1/SC2/WG2 N2641

JTC1/SC2/WG2 N

5 U+1F641 SLIGHTLY FROWNING FACE

A feature of Teuthonista is the stacking of two characters to denote an intermediate sound, like p

B. Technical General 1. Choose one of the following: 1a. This proposal is for a new script (set of characters) Yes.

Proposal to Encode Combining Half Marks Used for Cyrillic Supralineation in Unicode

1.1 The digit forms propagated by the Dozenal Society of Great Britain

Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation

1. Introduction. 2. Encoding Considerations

ä + ñ ISO/IEC JTC1/SC2/WG2 N

JTC1/SC2/WG2 N4945R

JTC1/SC2/WG

If this proposal is adopted, the following three characters would exist: with the following properties (including a change from Ll to Lo for U+0294):

Reply to L2/10-327: Comments on L2/10-280, Proposal to Add Variation Sequences... 1

ZWJ requests that glyphs in the highest available category be used; ZWNJ requests that glyphs in the lowest available category be used.

Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation

Ꞑ A790 LATIN CAPITAL LETTER A WITH SPIRITUS LENIS ꞑ A791 LATIN SMALL LETTER A WITH SPIRITUS LENIS

Four other characters under ballot are correctly named. Giving x on its own for comparison, and showing the additions in plain and italic type:

1. Introduction. 2. Encoding Considerations

Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation

Fig. 2 of E.161 Fig. 3 of E.161 Fig. 4 of E.161

1. Introduction. Properties: 2E4E;DOUBLE HYPHEN;Pd;0;ON;;;;;N;;;;; Entry in LineBreak.TXT: 2E4E;BA # DOUBLE HYPHEN

Although the International Phonetic Alphabet deprecated the letters. Unicode Character Properties. Character properties are proposed here.

CYRILLIC LETTER OMEGA WITH TITLO

ISO/IEC JTC/1 SC/2 WG/2 N2312. ISO/IEC JTC/1 SC/2 WG/2 Universal Multiple-Octet Coded Character Set (UCS)

Doc Type: Working Group Document Title: Recommendations for encoding Xiàngqí game symbols Source: Michael Everson (

transcribing Urdu or Arabic words. Accordingly, the KHHA and GHHA should be considered atomic, as Tibetan TSA, TSHA, and DZA are.

Universal Multiple-Octet Coded Character Set

A U+A7A0 LATIN CAPITAL LETTER G WITH OBLIQUE STROKE B C U+A7A2 LATIN CAPITAL LETTER K WITH OBLIQUE STROKE D

097B Ä DEVANAGARI LETTER GGA 097C Å DEVANAGARI LETTER JJA 097E Ç DEVANAGARI LETTER DDDA 097F É DEVANAGARI LETTER BBA

PHOENICIAN NUMBER ONE. More recent research into Imperial Aramaic (N3273R2, L2/07-197R2, 2007-

ISO/TC46/SC4/WG1 N 240, ISO/TC46/SC4/WG1 N

Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound.

A. Administrative. B. Technical General

0CE2 Ä KANNADA VOWEL SIGN VOCALIC L 0CE3 Å KANNADA VOWEL SIGN VOCALIC LL 0CE4 Ç KANNADA DANDA 0CE5 É KANNADA DOUBLE DANDA

This document is to be used together with N2285 and N2281.

JTC1/SC2/WG2 N3514. Proposal to encode a SOCCER BALL symbol Karl Pentzlin, U+26xx SOCCER BALL

G U+20BD COMBINING MUSCOVY RUBLE SIGN J U+20BE COMBINING MUSCOVY DENGA SIGN H U+20BB IMPERIAL RUBLE SIGN K U+20BC IMPERIAL KOPECK SIGN

Proposal to encode ADLAM LETTER APOSTROPHE for ADLaM script

Sebastian Kempgen Features of the "Kliment Std" Font v. 5.0, 2018

the character FE0F; emoji style; FE0F; emoji style; FE0F; emoji style; FE0F; emoji style; FE0F; emoji style; FE0F; emoji style; FE0F; emoji style;

Request for encoding 1CF4 VEDIC TONE CANDRA ABOVE

ARABIC LETTER BEH WITH SMALL ARABIC LETTER MEEM ABOVE MBEH (BEH WITH MEEM ABOVE) Represents mba sound

JTC1/SC2/WG2 N3048. Form number: N2352-F (Original ; Revised , , , , , , ) Page 1 of 5

1. Introduction.

Information technology Database languages SQL Part 1: Framework (SQL/Framework)

5. Glyphs. The 40 basic Unifon letters as used for English phonemes are as follows:

Proposal to encode fourteen Pakistani Quranic marks

A. Administrative. B. Technical -- General

RomanCyrillic Std v. 7

German National Body comment on SC 2 N4052 Date: Document: WG2 N3592-Germany

ISO/IEC JTC 1/SC 35. User Interfaces. Secretariat: Association Française de Normalisation (AFNOR)

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete***

A. Administrative. B. Technical General

Page 2 of 6 We therefore propose to unify characters with different glyph forms in different source dictionaries where the glyph variants are not used

Üù àõ [tai 2 l 6] (in older orthography Üù àõ»). Tai Le orthography is simple and straightforward:

Annex A: Proposal Summary Form Page 2 of 3

Proposal to Encode Phonetic Symbols with Retroflex Hook in the UCS

ISO/IEC JTC1/SC2/WG2 N xxxx L2/04-406

European Ordering Rules

Response to the revised "Final proposal for encoding the Phoenician script in the UCS" (L2/04-141R2)

The Unicode Standard Version 11.0 Core Specification

Information technology Coded graphic character set for text communication Latin alphabet

Introduction. Requests. Background. New Arabic block. The missing characters

Andrew Glass and Shriramana Sharma. anglass-at-microsoft-dot-com jamadagni-at-gmail-dot-com November-2

Keyboard Version 1.1 designed with Manual version 1.2, June Prepared by Vincent M. Setterholm, Logos Research Systems, Inc.

Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

ISO/IEC JTC 1/SC 35 User Interfaces Secretariat: AFNOR

Proposal to add U+2B95 Rightwards Black Arrow to Unicode Emoji

ISO/IEC JTC 1/SC 35 User Interfaces Secretariat: AFNOR

Information technology Biometric data interchange formats Part 2: Finger minutiae data

INTERNATIONAL STANDARD. Part 2: Arabic I anguage - Simplified transliteration

also represented by combnining vowel matras with ē, and ō: ayɯ, eyi, ayi;

Information technology Document Schema Definition Languages (DSDL) Part 8: Document Semantics Renaming Language (DSRL)

ISO/IEC JTC1/SC2/WG2 N3734 L2/09-144R

Information technology Database languages SQL Part 14: XML-Related Specifications (SQL/XML)

L2/ Universal Multiple-Octet Coded Character Set

Proposal to Add Four SENĆOŦEN Latin Charaters

SOURCE: U.S. National Body, UTC, and Script Encoding Initiative (UC Berkeley) ACTION: For consideration by WG2 DISTRIBUTION: ISO/IEC JTC1/SC2/WG

REGISTRATION NUMBER: G1: G2: G3: C0: C1: NAME Supplementary set for Latin-4 alternative with EURO SIGN

ISO/IEC JTC 1/SC 2/WG 2/N2789 L2/04-224

8Floats and footnotes

1. Question about how to choose a suitable program. 2. Contrasting and analyzing fairness of these two programs

Information technology Open Systems Interconnection The Directory Part 6: Selected attribute types

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set

Information and documentation Romanization of Chinese

ISO/IEC JTC1/SC2/WG2. Universal Multiple-Octet Coded Character Set (UCS) - ISO/IEC Secretariat: ANSI

ISO/IEC JTC1/SC2/WG2 N 4346R

Standardizing the order of Arabic combining marks

Transcription:

JTC1/SC2/WG2 N3915 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (two possible solutions) Author: Karl Pentzlin Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: Foreword for this preliminary proposal This proposal is presented to the JTC1/SC2/WG2 October 2010 meeting in Busan, with the kind request to take a decision which of the two solutions presented here shall be followed in principle in the final proposal. (While there are hints that the solution B is already preferred by the UTC, this could not be confirmed from the minutes of the UTC meetings. This preliminary proposal intends to document the possible solutions publicly; a public documentation of the decision about the way to proceed, by a SC2/WG2 resolution, is welcomed.) 1. Introduction In several phonetic and dialectological transcription systems, diacritical marks are found which span over three base characters. However, there are only four forms found so far: conjoining bows above or below the base characters, a tilde below the base characters, and a roof (i.e. a circumflex) above the base characters. Two of them (the bows) already were proposed (thus, the examples on which the proposals were based are not repeated here [in this preliminary proposal; in the final version they are intended to be included]). The "bow above" (here shown as U+1AFD COMBINING TRIPLE INVERTED BREVE ABOVE) was proposed in L2/09-028 (N3571) "Proposal to encode additional characters for the Uralic Phonetic Alphabet" by Finnish and Irish NB as U+1DFB COMBINING TRIPLE INVERTED BREVE. The "bow below" (here shown as U+1AFD COMBINING TRIPLE BREVE BELOW) was proposed in L2/08-392 "Proposal to encode a combining diacritical mark for Low German dialect writing" as U+1DFD COMBINING TRIPLE BREVE BELOW. Regarding the third mark, an example is shown in Fig. 1. As there many projects to encode characters for the dialectology of several countries are ongoing, such triple marks are urgently needed. Fig. 3 shows an example from Slovenia, proving that the triple marks are really productive. To use such triple marks with the Latin script provides a challenge for rendering engines anyway. Here, two solutions are proposed: Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 1 of 7

Solution A proposes to encode the combining triple marks as single characters, like it is done for combining double marks like U+035C...U+0362, U+1DCD, and U+1DFC. Solution B proposes to use building blocks, which can be used to create bows spanning over/below an arbitrary long sequence of base characters, using the mechanism provided by U+FE24...FE26 to accomplish continuous macrons over sequences of Coptic letters. In principle, both solutions can be accepted together, to use the triple diacritics for units with defined semantics (like in dialect writing/transcription), while to use the building blocks for prosody and other kinds of marking arbitrary sequences of characters. This has precedence in encoding U+FE22/FE23 COMBINING DOUBLE TILDE LEFT/RIGHT HALG in parallel to U+0360 COMBINING DOUBLE TILDE. In fact, accepting Solution A, together with U+FE27...FE2C from Solution B, provides the maximum of flexibility and usability together with a high degree of consistency. 2. Discussion of Solution A To have the triple diacritical marks encoded as characters of their own, has the advantage that the character identity is clearer from the beginning on. Semantic units, as the elements of dialectological writing/transcribing, are preserved. The triple marks, as proposed here, are to be combined with (i.e. to be input after) the first of the three base characters that they are to be applied to. This is in conformance with the use of the already encoded double marks, which are combined with the first of the two base characters that they are applied to. To the rendering engine, the new demands are marginal. Even now, with double diacritics spanning over two base characters, a rendering system of quality has to take into account the width of both base characters, and the placement of all combining marks which are attached to any of these two base characters. Thus, the rendering engine can generate its graphical output only when the second base character (which in fact follows the double mark) with its entire single combining marks is read. The introduction of triple combining characters extends this for one base character more, without introducing something totally new. To use the system of Canonical Combining Class Values in an optimal way, two new such values should be introduced (235 "Triple below" and 236 "Triple above", following the existing 233 "Double below" and 234 "Double above"). This would have the effect to get the triple marks "outermost", beyond the double marks which are the "outermost" diacritics until now (ignoring the special case of iota subscriptum). If any stability policy prevents this, the values 233 and 234 can be used for triple marks also, as no example of stacking double and triple marks together is known until now. 3. Discussion of Solution B This solution follows the principle outlined in the UTC Action Item 120-A87 (see L2/09/225, re L2/09-281 COMBINING TRIPLE INVERTED BREVE and other triple length combining marks" by Deborah Anderson), where is stated: " Action Item for Peter Constable (UTC Liaison to WG2): Triple marks should be handled by the mechanism associated with two part diacritics in the Combining Half Marks block at U+FE20." Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 2 of 7

First, this solution requires the scope of the already encoded characters: U+FE20 COMBINING LIGATURE LEFT HALF U+FE26 COMBINING CONJOINING MACRON U+FE20 COMBINING LIGATURE LEFT HALF to be expanded: Any sequence of one base character with U+FE20 applied, one or more base characters each with U+FE26 applied, and one base character with U+FE21 applied, shall yield a ligature bow over the complete sequence of base characters, starting with the one to which U+FE20 is applied, and ending with the one to which U+FE21 is applied. Then, the same applies to sequences regarding the proposed U+FE27, U+FE2C, U+FE29, except that the ligature bow goes under the character sequence. Likewise, the same applies to sequences regarding the proposed U+FE28, U+FE2C, U+FE29, except that this yields a tilde under the character sequence. As both the ligature bow and the tilde end in a turn up, and as the sequences are to be rendered in a special way anyway determined by the starting part, the end parts for bow and tilde are unified in U+FE29. (Otherwise, either the obvious reservation for the left and right macron below halves had to be dropped, or the characters could not have been placed into the same block.) Also, this applies to sequences regarding the proposed U+FE2D, U+FE2F, U+FE2E, except that a circumflex instead a bow is applied over the sequence. The application of more than one U+FE2F achieves only one peak displayed regardless how much U+FE2F are contained in the sequence. (This shows one disadvantage of Solution B, compared with Solution A: Three different characters are needed to get the same graphical representation, which is achieved with Solution A with only one character.) The demands on the rendering engine are more extensive than for Solution A. Until now, the "building block principle" only duplicates the way two special characters may be represented (U+FE20/U+FE21 resemble U+0361, and U+FE22/U+FE23 resemble U+0360), or renders at straight lines which cause no real trouble if not applied in the correct order (i.e. a base character with U+FE24 is to be followed by one with U+FE26, and one with U+FE26 is to be followed by one either with U+FE26 again or with U+FE25). Solution B would extend this to arbitrary long bows that have to be computed after processing the whole character sequence until the correct end part has to be found, including doing all error processing if the expected parts of the sequence are not found, or if the required beginning part is missed. However, there are applications at least for the bows where it is desirable to apply bows of arbitrary length in plain text (see fig. 2). Also, copying and pasting base characters within such sequences may be more transparent to the user, as (at least when the base characters with all its combining diacritical marks are affected) carry their "bow part" with them. However, this may be a drawback when the triple marks are part of units of a dialectological writing/transliterating system. Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 3 of 7

4. Proposed Characters Solution A The code points are based on the assumption that the new block Combining Diacritical Marks Extended-A at U+1AB0 U+1AFF, as contained in other proposals, is accepted. U+1AFC COMBINING TRIPLE CIRCUMFLEX ABOVE U+1AFD COMBINING TRIPLE INVERTED BREVE ABOVE U+1AFE COMBINING TRIPLE BREVE BELOW U+1AFF COMBINING TRIPLE TILDE BELOW Properties: 1AFC;COMBINING TRIPLE CIRCUMFLEX ABOVE;Mn;236;NSM;;;;;N;;;;; 1AFD;COMBINING TRIPLE INVERTED BREVE ABOVE;Mn;236;NSM;;;;;N;;;;; 1AFE;COMBINING TRIPLE BREVE BELOW;Mn;235;NSM;;;;;N;;;;; 1AFF;COMBINING TRIPLE TILDE BELOW;Mn;235;NSM;;;;;N;;;;; New Canonical Combining Class Values: 235 Triple below 236 Triple above Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 4 of 7

5. Proposed Characters Solution B U+FE27 COMBINING LIGATURE BELOW LEFT HALF U+FE28 COMBINING TILDE BELOW LEFT HALF U+FE29 COMBINING LIGATURE OR TILDE BELOW RIGHT HALF U+FE2A, U+FE2B: reserved (for "combining conjoining macron below" halves, if needed) U+FE2C COMBINING CONJOINING MACRON BELOW U+FE2D COMBINING CONJOINING CIRCUMFLEX ASCENDER U+FE2E COMBINING CONJOINING CIRCIMFLE X DESCENDER U+FE2F COMBINING CONJOINING CIRCUMFLEX PEAK the conjoining of more than one of U+FE2F render in a larger single peak on the affected base characters Properties: FE27;COMBINING LIGATURE BELOW LEFT HALF;Mn;230;NSM;;;;;N;;;;; FE28;COMBINING TILDE BELOW LEFT HALF;Mn;230;NSM;;;;;N;;;;; FE29;COMBINING LIGATURE OR TILDE BELOW RIGHT HALF;Mn;230;NSM;;;;;N;;;;; FE2C;COMBINING CONJOINING MACRON;Mn;230;NSM;;;;;N;;;;; FE2D;COMBINING CONJOINING CIRCUMFLEX ASCENDER;Mn;230;NSM;;;;;N;;;;; FE2E;COMBINING CONJOINING CIRCUMFLEX DESCENDER;Mn;230;NSM;;;;;N;;;;; FE2F;COMBINING CONJOINING CIRCUMFLEX PEAK;Mn;230;NSM;;;;;N;;;;; Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 5 of 7

6. Examples and Figures Fig. 1: Elliot's "Runes" (Manchester University Press, 1987), showing an example of a threeletter bind rune transliterated with a circumflex over the three letters "der". Thanks to Andrew West for providing this example. Fig. 2: V. E. Nash-Williams, "The Early Christian Monuments of Wales". Cardiff, 1950: showing bows upon letter sequences of different length, which denote some features of the text, rather than marking special phonetic units with distinctive semantics. Thanks to Andrew West for providing this example. Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 6 of 7

Fig. 3: Excerpt from the PUA of the Font ZRCola.ttf used for the Obščeslavjanskij ingvističeskij atlas, showing a lot of transcription units using bows on three base letters (at EE0C, EE25...EE28, EE3D). Thanks to Peter Weiss for providing that font. Fig. 4: A specimen for the triple tilde below from: Wenker, Georg, et al.: Deutscher Sprachatlas auf der Grundlage des Sprachatlas des Deutschen Reichs, Marburg (Lahn) 1927-1956; introduction, p. 18. Preiiminary Proposal to enable the use of Combining Triple Diacritics within Plain Text Page 7 of 7